Hi Quincey, thanks for info. I could not seem to find what is the default value for H5Pset_istore_k, is it documented anywhere? What would be recommended value for our case with many small datasets?
Cheers, Andy Quincey Koziol wrote onĀ 2010-08-31: > Hi Andy, > > On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote: > >> Hi HDF5 gurus, >> >> We are seeing some unexpected behavior creating HDF5 files >> which we need to understand. Our data sets have wildly varying >> sizes and complexity. In one extreme case we need to write >> many (thousands) relatively small datasets in one HDF5 file. >> Because the size of the datasets and their number is not known >> in advance we prefer to use chunked storage for the data. >> What we see is that when the number of chunked datasets is large >> then the size of the HDF5 file becomes much large that the >> volume of the data stored. It seems that there is a significant >> overhead per dataset for chunked data. >> >> I tried to understand where this overhead comes from. Running >> h5stat on several files I see that the size of the B-Tree for >> chunked datasets in our case takes more space than even data >> itself. I collected the dependency of the B-Tree size on the >> number of the chunked datasets in a file. It looks like the size >> of the tree grows linearly with the number of the chunked datasets >> (at last in our case when there is just one chunk per dataset). >> It seems that there is 2096 bytes of B-Tree space allocated per >> every chunked dataset. With large number of datasets (tens of >> thousands) the overhead becomes very big in our case. Below is >> an example of h5stat output for one of our problematic files >> if anybody is interested to look at it. >> >> Is my analysis of B-Tree size growth correct? > > Yes, I think your analysis is correct. Currently, there's at least > one B-tree node per chunked dataset (that will be changing with the 1.10.0 > release, when it's finished). > >> Is there a way to reduce the size? > > You should be able to use the H5Pset_istore_k() API routine > (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to > reduce the B-tree fanout value. > > Quincey > >> Cheers, >> Andy >> >> ================================================================ >> >> File information >> # of unique groups: 31107 # of unique datasets: 58767 # of >> unique named datatypes: 0 # of unique links: 0 # of unique >> other: 0 Max. # of links to object: 1 Max. # of objects in >> group: 201 Object header size: (total/unused) Groups: 1311904/0 >> Datasets: 27052128/16288 Datatypes: 0/0 Storage information: >> Groups: >> B-tree/List: 28798864 Heap: 4964928 Attributes: >> B-tree/List: 0 Heap: 0 Chunked datasets: B-tree: >> 115116512 Shared Messages: Header: 0 B-tree/List: 0 >> Heap: 0 >> Superblock extension: 0 Small groups: # of groups of size 1: >> 1222 # of groups of size 2: 27663 # of groups of size 3: 1211 # >> of groups of size 4: 203 # of groups of size 5: 403 Total # of >> small groups: 30702 Group bins: # of groups of size 1 - 9: 30702 >> # of groups of size 10 - 99: 202 # of groups of size 100 - 999: >> 203 Total # of groups: 31107 Dataset dimension information: Max. >> rank of datasets: 1 Dataset ranks: >> # of dataset with rank 0: 2230 >> # of dataset with rank 1: 56537 >> 1-D Dataset information: >> Max. dimension size of 1-D datasets: 600 >> Small 1-D datasets: >> # of dataset dimensions of size 1: 1461 # of dataset >> dimensions of size 4: 202 # of dataset dimensions of >> size 9: 202 Total small datasets: 1865 1-D Dataset >> dimension bins: # of datasets of size 1 - 9: 1865 # of >> datasets of size 10 - 99: 49044 # of datasets of size >> 100 - 999: 5628 Total # of datasets: 56537 >> Dataset storage information: >> Total raw data size: 74331886 Dataset layout information: >> Dataset layout counts[COMPACT]: 0 Dataset layout counts[CONTIG]: >> 3845 Dataset layout counts[CHUNKED]: 54922 Number of external >> files : 0 Dataset filters information: Number of datasets with: >> NO filter: 3845 >> GZIP filter: 54922 >> SHUFFLE filter: 0 >> FLETCHER32 filter: 0 >> SZIP filter: 0 >> NBIT filter: 0 >> SCALEOFFSET filter: 0 >> USER-DEFINED filter: 0 >> >> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> [email protected] >> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > > > _______________________________________________ Hdf-forum is for HDF > software users discussion. [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
