Re: [Hdf-forum] B-Tree size for chunked dataset

Quincey Koziol Tue, 31 Aug 2010 08:19:48 -0700

Hi Andy,

On Aug 31, 2010, at 10:04 AM, Salnikov, Andrei A. wrote:


> Hi Quincey,
> 
> thanks for info. I could not seem to find what is the default 
> value for H5Pset_istore_k, is it documented anywhere?

        The default value is 32 (which means the fanout is 64).  (You can also 
call H5Pget_istore_k() with a newly created property list to query it)

> What would be recommended value for our case with many small datasets?

        You could probably turn it all the way down to 2-4 without any 
problems, if all the datasets have very few chunks.

        Quincey

> Cheers,
> Andy
> 
> Quincey Koziol wrote on 2010-08-31:
>> Hi Andy,
>> 
>> On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:
>> 
>>> Hi HDF5 gurus,
>>> 
>>> We are seeing some unexpected behavior creating HDF5 files
>>> which we need to understand. Our data sets have wildly varying
>>> sizes and complexity. In one extreme case we need to write
>>> many (thousands) relatively small datasets in one HDF5 file.
>>> Because the size of the datasets and their number is not known
>>> in advance we prefer to use chunked storage for the data.
>>> What we see is that when the number of chunked datasets is large
>>> then the size of the HDF5 file becomes much large that the
>>> volume of the data stored. It seems that there is a significant
>>> overhead per dataset for chunked data.
>>> 
>>> I tried to understand where this overhead comes from. Running
>>> h5stat on several files I see that the size of the B-Tree for
>>> chunked datasets in our case takes more space than even data
>>> itself. I collected the dependency of the B-Tree size on the
>>> number of the chunked datasets in a file. It looks like the size
>>> of the tree grows linearly with the number of the chunked datasets
>>> (at last in our case when there is just one chunk per dataset).
>>> It seems that there is 2096 bytes of B-Tree space allocated per
>>> every chunked dataset. With large number of datasets (tens of
>>> thousands) the overhead becomes very big in our case. Below is
>>> an example of h5stat output for one of our problematic files
>>> if anybody is interested to look at it.
>>> 
>>> Is my analysis of B-Tree size growth correct?
>> 
>>      Yes, I think your analysis is correct.  Currently, there's at least
>> one B-tree node per chunked dataset (that will be changing with the 1.10.0
>> release, when it's finished).
>> 
>>> Is there a way to reduce the size?
>> 
>>      You should be able to use the H5Pset_istore_k() API routine
>> (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to
>> reduce the B-tree fanout value.      
>> 
>>      Quincey
>> 
>>> Cheers,
>>> Andy
>>> 
>>> ================================================================
>>> 
>>> File information
>>>       # of unique groups: 31107 # of unique datasets: 58767 # of
>>>       unique named datatypes: 0 # of unique links: 0 # of unique
>>>       other: 0 Max. # of links to object: 1 Max. # of objects in
>>>       group: 201 Object header size: (total/unused) Groups: 1311904/0
>>>       Datasets: 27052128/16288 Datatypes: 0/0 Storage information:
>>>       Groups:
>>>               B-tree/List: 28798864 Heap: 4964928 Attributes:
>>>               B-tree/List: 0 Heap: 0 Chunked datasets: B-tree:
>>>               115116512 Shared Messages: Header: 0 B-tree/List: 0
>>>               Heap: 0
>>>       Superblock extension: 0 Small groups: # of groups of size 1:
>>>       1222 # of groups of size 2: 27663 # of groups of size 3: 1211 #
>>>       of groups of size 4: 203 # of groups of size 5: 403 Total # of
>>>       small groups: 30702 Group bins: # of groups of size 1 - 9: 30702
>>>       # of groups of size 10 - 99: 202 # of groups of size 100 - 999:
>>>       203 Total # of groups: 31107 Dataset dimension information: Max.
>>>       rank of datasets: 1 Dataset ranks:
>>>               # of dataset with rank 0: 2230
>>>               # of dataset with rank 1: 56537
>>> 1-D Dataset information:
>>>       Max. dimension size of 1-D datasets: 600
>>>       Small 1-D datasets:
>>>               # of dataset dimensions of size 1: 1461 # of dataset
>>>               dimensions of size 4: 202 # of dataset dimensions of
>>>               size 9: 202 Total small datasets: 1865 1-D Dataset
>>>               dimension bins: # of datasets of size 1 - 9: 1865 # of
>>>               datasets of size 10 - 99: 49044 # of datasets of size
>>>               100 - 999: 5628 Total # of datasets: 56537
>>> Dataset storage information:
>>>       Total raw data size: 74331886 Dataset layout information:
>>>       Dataset layout counts[COMPACT]: 0 Dataset layout counts[CONTIG]:
>>>       3845 Dataset layout counts[CHUNKED]: 54922 Number of external
>>>       files : 0 Dataset filters information: Number of datasets with:
>>>               NO filter: 3845
>>>               GZIP filter: 54922
>>>               SHUFFLE filter: 0
>>>               FLETCHER32 filter: 0
>>>               SZIP filter: 0
>>>               NBIT filter: 0
>>>               SCALEOFFSET filter: 0
>>>               USER-DEFINED filter: 0
>>> 
>>> 
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [email protected]
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>> 
>> 
>> _______________________________________________ Hdf-forum is for HDF
>> software users discussion. [email protected]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> 
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] B-Tree size for chunked dataset

Reply via email to