Hi Andy,
On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:
> Hi HDF5 gurus,
>
> We are seeing some unexpected behavior creating HDF5 files
> which we need to understand. Our data sets have wildly varying
> sizes and complexity. In one extreme case we need to write
> many (thousands) relatively small datasets in one HDF5 file.
> Because the size of the datasets and their number is not known
> in advance we prefer to use chunked storage for the data.
> What we see is that when the number of chunked datasets is large
> then the size of the HDF5 file becomes much large that the
> volume of the data stored. It seems that there is a significant
> overhead per dataset for chunked data.
>
> I tried to understand where this overhead comes from. Running
> h5stat on several files I see that the size of the B-Tree for
> chunked datasets in our case takes more space than even data
> itself. I collected the dependency of the B-Tree size on the
> number of the chunked datasets in a file. It looks like the size
> of the tree grows linearly with the number of the chunked datasets
> (at last in our case when there is just one chunk per dataset).
> It seems that there is 2096 bytes of B-Tree space allocated per
> every chunked dataset. With large number of datasets (tens of
> thousands) the overhead becomes very big in our case. Below is
> an example of h5stat output for one of our problematic files
> if anybody is interested to look at it.
>
> Is my analysis of B-Tree size growth correct?
Yes, I think your analysis is correct. Currently, there's at least one
B-tree node per chunked dataset (that will be changing with the 1.10.0 release,
when it's finished).
> Is there a way to reduce the size?
You should be able to use the H5Pset_istore_k() API routine
(http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to reduce
the B-tree fanout value.
Quincey
> Cheers,
> Andy
>
> ================================================================
>
> File information
> # of unique groups: 31107
> # of unique datasets: 58767
> # of unique named datatypes: 0
> # of unique links: 0
> # of unique other: 0
> Max. # of links to object: 1
> Max. # of objects in group: 201
> Object header size: (total/unused)
> Groups: 1311904/0
> Datasets: 27052128/16288
> Datatypes: 0/0
> Storage information:
> Groups:
> B-tree/List: 28798864
> Heap: 4964928
> Attributes:
> B-tree/List: 0
> Heap: 0
> Chunked datasets:
> B-tree: 115116512
> Shared Messages:
> Header: 0
> B-tree/List: 0
> Heap: 0
> Superblock extension: 0
> Small groups:
> # of groups of size 1: 1222
> # of groups of size 2: 27663
> # of groups of size 3: 1211
> # of groups of size 4: 203
> # of groups of size 5: 403
> Total # of small groups: 30702
> Group bins:
> # of groups of size 1 - 9: 30702
> # of groups of size 10 - 99: 202
> # of groups of size 100 - 999: 203
> Total # of groups: 31107
> Dataset dimension information:
> Max. rank of datasets: 1
> Dataset ranks:
> # of dataset with rank 0: 2230
> # of dataset with rank 1: 56537
> 1-D Dataset information:
> Max. dimension size of 1-D datasets: 600
> Small 1-D datasets:
> # of dataset dimensions of size 1: 1461
> # of dataset dimensions of size 4: 202
> # of dataset dimensions of size 9: 202
> Total small datasets: 1865
> 1-D Dataset dimension bins:
> # of datasets of size 1 - 9: 1865
> # of datasets of size 10 - 99: 49044
> # of datasets of size 100 - 999: 5628
> Total # of datasets: 56537
> Dataset storage information:
> Total raw data size: 74331886
> Dataset layout information:
> Dataset layout counts[COMPACT]: 0
> Dataset layout counts[CONTIG]: 3845
> Dataset layout counts[CHUNKED]: 54922
> Number of external files : 0
> Dataset filters information:
> Number of datasets with:
> NO filter: 3845
> GZIP filter: 54922
> SHUFFLE filter: 0
> FLETCHER32 filter: 0
> SZIP filter: 0
> NBIT filter: 0
> SCALEOFFSET filter: 0
> USER-DEFINED filter: 0
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org