Hi Andy,
On Aug 31, 2010, at 10:04 AM, Salnikov, Andrei A. wrote:
> Hi Quincey,
>
> thanks for info. I could not seem to find what is the default
> value for H5Pset_istore_k, is it documented anywhere?
The default value is 32 (which means the fanout is 64). (You can also
call H5Pget_istore_k() with a newly created property list to query it)
> What would be recommended value for our case with many small datasets?
You could probably turn it all the way down to 2-4 without any
problems, if all the datasets have very few chunks.
Quincey
> Cheers,
> Andy
>
> Quincey Koziol wrote on 2010-08-31:
>> Hi Andy,
>>
>> On Aug 30, 2010, at 9:06 PM, Salnikov, Andrei A. wrote:
>>
>>> Hi HDF5 gurus,
>>>
>>> We are seeing some unexpected behavior creating HDF5 files
>>> which we need to understand. Our data sets have wildly varying
>>> sizes and complexity. In one extreme case we need to write
>>> many (thousands) relatively small datasets in one HDF5 file.
>>> Because the size of the datasets and their number is not known
>>> in advance we prefer to use chunked storage for the data.
>>> What we see is that when the number of chunked datasets is large
>>> then the size of the HDF5 file becomes much large that the
>>> volume of the data stored. It seems that there is a significant
>>> overhead per dataset for chunked data.
>>>
>>> I tried to understand where this overhead comes from. Running
>>> h5stat on several files I see that the size of the B-Tree for
>>> chunked datasets in our case takes more space than even data
>>> itself. I collected the dependency of the B-Tree size on the
>>> number of the chunked datasets in a file. It looks like the size
>>> of the tree grows linearly with the number of the chunked datasets
>>> (at last in our case when there is just one chunk per dataset).
>>> It seems that there is 2096 bytes of B-Tree space allocated per
>>> every chunked dataset. With large number of datasets (tens of
>>> thousands) the overhead becomes very big in our case. Below is
>>> an example of h5stat output for one of our problematic files
>>> if anybody is interested to look at it.
>>>
>>> Is my analysis of B-Tree size growth correct?
>>
>> Yes, I think your analysis is correct. Currently, there's at least
>> one B-tree node per chunked dataset (that will be changing with the 1.10.0
>> release, when it's finished).
>>
>>> Is there a way to reduce the size?
>>
>> You should be able to use the H5Pset_istore_k() API routine
>> (http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetIstoreK) to
>> reduce the B-tree fanout value.
>>
>> Quincey
>>
>>> Cheers,
>>> Andy
>>>
>>> ================================================================
>>>
>>> File information
>>> # of unique groups: 31107 # of unique datasets: 58767 # of
>>> unique named datatypes: 0 # of unique links: 0 # of unique
>>> other: 0 Max. # of links to object: 1 Max. # of objects in
>>> group: 201 Object header size: (total/unused) Groups: 1311904/0
>>> Datasets: 27052128/16288 Datatypes: 0/0 Storage information:
>>> Groups:
>>> B-tree/List: 28798864 Heap: 4964928 Attributes:
>>> B-tree/List: 0 Heap: 0 Chunked datasets: B-tree:
>>> 115116512 Shared Messages: Header: 0 B-tree/List: 0
>>> Heap: 0
>>> Superblock extension: 0 Small groups: # of groups of size 1:
>>> 1222 # of groups of size 2: 27663 # of groups of size 3: 1211 #
>>> of groups of size 4: 203 # of groups of size 5: 403 Total # of
>>> small groups: 30702 Group bins: # of groups of size 1 - 9: 30702
>>> # of groups of size 10 - 99: 202 # of groups of size 100 - 999:
>>> 203 Total # of groups: 31107 Dataset dimension information: Max.
>>> rank of datasets: 1 Dataset ranks:
>>> # of dataset with rank 0: 2230
>>> # of dataset with rank 1: 56537
>>> 1-D Dataset information:
>>> Max. dimension size of 1-D datasets: 600
>>> Small 1-D datasets:
>>> # of dataset dimensions of size 1: 1461 # of dataset
>>> dimensions of size 4: 202 # of dataset dimensions of
>>> size 9: 202 Total small datasets: 1865 1-D Dataset
>>> dimension bins: # of datasets of size 1 - 9: 1865 # of
>>> datasets of size 10 - 99: 49044 # of datasets of size
>>> 100 - 999: 5628 Total # of datasets: 56537
>>> Dataset storage information:
>>> Total raw data size: 74331886 Dataset layout information:
>>> Dataset layout counts[COMPACT]: 0 Dataset layout counts[CONTIG]:
>>> 3845 Dataset layout counts[CHUNKED]: 54922 Number of external
>>> files : 0 Dataset filters information: Number of datasets with:
>>> NO filter: 3845
>>> GZIP filter: 54922
>>> SHUFFLE filter: 0
>>> FLETCHER32 filter: 0
>>> SZIP filter: 0
>>> NBIT filter: 0
>>> SCALEOFFSET filter: 0
>>> USER-DEFINED filter: 0
>>>
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [email protected]
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>
>>
>> _______________________________________________ Hdf-forum is for HDF
>> software users discussion. [email protected]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org