Re: [Hdf-forum] H5Oget_info performance with large number of chunks

Quincey Koziol Thu, 10 Mar 2011 10:50:26 -0800

Hi Andy,

On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:


> 
> Quincey Koziol wrote on 2011-03-09:
>> Hi Andy,
>> 
>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:
>> 
>>> Hi,
>>> 
>>> I'm trying to understand a performance hit that we are
>>> experiencing trying to examine the tree structure of
>>> our HDF5 files. Originally we observed problem when
>>> using h5py but it could be reproduced even with h5ls
>>> command. I tracked it down to a significant delay in
>>> the call to H5Oget_info_by_name function on a dataset
>>> with a large number of chunks. It looks like when the
>>> number of chunks in dataset increases (in our case
>>> we have 1-10k chunks) the performance of the H5Oget_info
>>> drops significantly. Looking at the IO statistics it
>>> seems that HDF5 library does very many small IO operations
>>> in this case. There is very little CPU spent, but real
>>> time is measured in tens of seconds.
>>> 
>>> Is this an expected behavior? Can it be improved somehow
>>> without reducing the number of chunks drastically?
>>> 
>>> One more comment about H5Oget_info - it returns a
>>> structure that contains a lot of different info.
>>> In the case of h5py code the only member of the
>>> structure used in the code is "type". could there be
>>> more efficient way to determine just the type of the
>>> object without requiring every other piece of info?
>> 
>>      Ah, yes, we've noticed that in some of the applications we've worked
>> with also (including some of the main HDF5 tools, like h5ls, etc).  As you
>> say, H5Oget_info() is fairly heavyweight, getting all sorts of information
>> about each object.  I do think a lighter-weight call like "H5Oget_type"
>> would be useful.  Is there other "lightweight" information that people
>> would like back for each object?
>> 
>>      Quincey
>> 
> 
> Hi Quincey,
> 
> thanks for confirming this. Could you explain briefly what is 
> going on there and which part of H5O_info_t needs so many reads?

        The H5Oget_info() call is gathering information about the amount of 
space that the metadata for the dataset is using.  When there's a large B-tree 
for indexing the chunks, that can take a fair bit of time to walk the B-tree.

>  Maybe removing heavyweight info from H5O_info_t is the right 
> thing to do, or creating another version of H5O_info_t structure
> which has only light-weight info?

        I'm leaning toward another light-weight version.  I'm asking the HDF5 
community to help me decide what goes into that structure besides the object 
type.

        Quincey


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] H5Oget_info performance with large number of chunks

Reply via email to