Re: [Hdf-forum] B-Tree size for chunked dataset

Quincey Koziol Tue, 31 Aug 2010 04:46:45 -0700

Hi Andy,

On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:


> Hi HDF5 gurus,
> 
> We are seeing some unexpected behavior creating HDF5 files
> which we need to understand. Our data sets have wildly varying 
> sizes and complexity. In one extreme case we need to write 
> many (thousands) relatively small datasets in one HDF5 file. 
> Because the size of the datasets and their number is not known 
> in advance we prefer to use chunked storage for the data.
> What we see is that when the number of chunked datasets is large
> then the size of the HDF5 file becomes much large that the 
> volume of the data stored. It seems that there is a significant 
> overhead per dataset for chunked data. 
> 
> I tried to understand where this overhead comes from. Running 
> h5stat on several files I see that the size of the B-Tree for
> chunked datasets in our case takes more space than even data 
> itself. I collected the dependency of the B-Tree size on the 
> number of the chunked datasets in a file. It looks like the size 
> of the tree grows linearly with the number of the chunked datasets
> (at last in our case when there is just one chunk per dataset).
> It seems that there is 2096 bytes of B-Tree space allocated per 
> every chunked dataset. With large number of datasets (tens of 
> thousands) the overhead becomes very big in our case. Below is 
> an example of h5stat output for one of our problematic files 
> if anybody is interested to look at it.
> 
> Is my analysis of B-Tree size growth correct?

        Yes, I think your analysis is correct.  Currently, there's at least one 
B-tree node per chunked dataset (that will be changing with the 1.10.0 release, 
when it's finished).

> Is there a way to reduce the size?

        You should be able to use the H5Pset_istore_k() API routine 
(http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to reduce 
the B-tree fanout value.

        Quincey
 
> Cheers,
> Andy
> 
> ================================================================
> 
> File information
>        # of unique groups: 31107
>        # of unique datasets: 58767
>        # of unique named datatypes: 0
>        # of unique links: 0
>        # of unique other: 0
>        Max. # of links to object: 1
>        Max. # of objects in group: 201
> Object header size: (total/unused)
>        Groups: 1311904/0
>        Datasets: 27052128/16288
>        Datatypes: 0/0
> Storage information:
>        Groups:
>                B-tree/List: 28798864
>                Heap: 4964928
>        Attributes:
>                B-tree/List: 0
>                Heap: 0
>        Chunked datasets:
>                B-tree: 115116512
>        Shared Messages:
>                Header: 0
>                B-tree/List: 0
>                Heap: 0
>        Superblock extension: 0
> Small groups:
>        # of groups of size 1: 1222
>        # of groups of size 2: 27663
>        # of groups of size 3: 1211
>        # of groups of size 4: 203
>        # of groups of size 5: 403
>        Total # of small groups: 30702
> Group bins:
>        # of groups of size 1 - 9: 30702
>        # of groups of size 10 - 99: 202
>        # of groups of size 100 - 999: 203
>        Total # of groups: 31107
> Dataset dimension information:
>        Max. rank of datasets: 1
>        Dataset ranks:
>                # of dataset with rank 0: 2230
>                # of dataset with rank 1: 56537
> 1-D Dataset information:
>        Max. dimension size of 1-D datasets: 600
>        Small 1-D datasets:
>                # of dataset dimensions of size 1: 1461
>                # of dataset dimensions of size 4: 202
>                # of dataset dimensions of size 9: 202
>                Total small datasets: 1865
>        1-D Dataset dimension bins:
>                # of datasets of size 1 - 9: 1865
>                # of datasets of size 10 - 99: 49044
>                # of datasets of size 100 - 999: 5628
>                Total # of datasets: 56537
> Dataset storage information:
>        Total raw data size: 74331886
> Dataset layout information:
>        Dataset layout counts[COMPACT]: 0
>        Dataset layout counts[CONTIG]: 3845
>        Dataset layout counts[CHUNKED]: 54922
>        Number of external files : 0
> Dataset filters information:
>        Number of datasets with:
>                NO filter: 3845
>                GZIP filter: 54922
>                SHUFFLE filter: 0
>                FLETCHER32 filter: 0
>                SZIP filter: 0
>                NBIT filter: 0
>                SCALEOFFSET filter: 0
>                USER-DEFINED filter: 0
> 
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] B-Tree size for chunked dataset

Reply via email to