[Hdf-forum] B-Tree size for chunked dataset

Salnikov, Andrei A. Mon, 30 Aug 2010 19:09:14 -0700

Hi HDF5 gurus,

We are seeing some unexpected behavior creating HDF5 files
which we need to understand. Our data sets have wildly varying 
sizes and complexity. In one extreme case we need to write 
many (thousands) relatively small datasets in one HDF5 file. 
Because the size of the datasets and their number is not known 
in advance we prefer to use chunked storage for the data.
What we see is that when the number of chunked datasets is large
then the size of the HDF5 file becomes much large that the 
volume of the data stored. It seems that there is a significant 
overhead per dataset for chunked data.


I tried to understand where this overhead comes from. Running 
h5stat on several files I see that the size of the B-Tree for
chunked datasets in our case takes more space than even data 
itself. I collected the dependency of the B-Tree size on the 
number of the chunked datasets in a file. It looks like the size 
of the tree grows linearly with the number of the chunked datasets
(at last in our case when there is just one chunk per dataset).
It seems that there is 2096 bytes of B-Tree space allocated per 
every chunked dataset. With large number of datasets (tens of 
thousands) the overhead becomes very big in our case. Below is 
an example of h5stat output for one of our problematic files 
if anybody is interested to look at it.

Is my analysis of B-Tree size growth correct? Is there a way to 
reduce the size?

Cheers,
Andy

================================================================

File information
        # of unique groups: 31107
        # of unique datasets: 58767
        # of unique named datatypes: 0
        # of unique links: 0
        # of unique other: 0
        Max. # of links to object: 1
        Max. # of objects in group: 201
Object header size: (total/unused)
        Groups: 1311904/0
        Datasets: 27052128/16288
        Datatypes: 0/0
Storage information:
        Groups:
                B-tree/List: 28798864
                Heap: 4964928
        Attributes:
                B-tree/List: 0
                Heap: 0
        Chunked datasets:
                B-tree: 115116512
        Shared Messages:
                Header: 0
                B-tree/List: 0
                Heap: 0
        Superblock extension: 0
Small groups:
        # of groups of size 1: 1222
        # of groups of size 2: 27663
        # of groups of size 3: 1211
        # of groups of size 4: 203
        # of groups of size 5: 403
        Total # of small groups: 30702
Group bins:
        # of groups of size 1 - 9: 30702
        # of groups of size 10 - 99: 202
        # of groups of size 100 - 999: 203
        Total # of groups: 31107
Dataset dimension information:
        Max. rank of datasets: 1
        Dataset ranks:
                # of dataset with rank 0: 2230
                # of dataset with rank 1: 56537
1-D Dataset information:
        Max. dimension size of 1-D datasets: 600
        Small 1-D datasets:
                # of dataset dimensions of size 1: 1461
                # of dataset dimensions of size 4: 202
                # of dataset dimensions of size 9: 202
                Total small datasets: 1865
        1-D Dataset dimension bins:
                # of datasets of size 1 - 9: 1865
                # of datasets of size 10 - 99: 49044
                # of datasets of size 100 - 999: 5628
                Total # of datasets: 56537
Dataset storage information:
        Total raw data size: 74331886
Dataset layout information:
        Dataset layout counts[COMPACT]: 0
        Dataset layout counts[CONTIG]: 3845
        Dataset layout counts[CHUNKED]: 54922
        Number of external files : 0
Dataset filters information:
        Number of datasets with:
                NO filter: 3845
                GZIP filter: 54922
                SHUFFLE filter: 0
                FLETCHER32 filter: 0
                SZIP filter: 0
                NBIT filter: 0
                SCALEOFFSET filter: 0
                USER-DEFINED filter: 0



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

[Hdf-forum] B-Tree size for chunked dataset

Reply via email to