Neil Fortner wrote on 2011-03-14:
> Andy,
> 
> On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote:
>> Quincey Koziol wrote on 2011-03-10:
>>> Hi Andy,
>>> 
>>> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:
>>> 
>>>> Quincey Koziol wrote on 2011-03-09:
>>>>> Hi Andy,
>>>>> 
>>>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm trying to understand a performance hit that we are
>>>>>> experiencing trying to examine the tree structure of
>>>>>> our HDF5 files. Originally we observed problem when
>>>>>> using h5py but it could be reproduced even with h5ls
>>>>>> command. I tracked it down to a significant delay in
>>>>>> the call to H5Oget_info_by_name function on a dataset
>>>>>> with a large number of chunks. It looks like when the
>>>>>> number of chunks in dataset increases (in our case
>>>>>> we have 1-10k chunks) the performance of the H5Oget_info
>>>>>> drops significantly. Looking at the IO statistics it
>>>>>> seems that HDF5 library does very many small IO operations
>>>>>> in this case. There is very little CPU spent, but real
>>>>>> time is measured in tens of seconds.
>>>>>> 
>>>>>> Is this an expected behavior? Can it be improved somehow
>>>>>> without reducing the number of chunks drastically?
>>>>>> 
>>>>>> One more comment about H5Oget_info - it returns a
>>>>>> structure that contains a lot of different info.
>>>>>> In the case of h5py code the only member of the
>>>>>> structure used in the code is "type". could there be
>>>>>> more efficient way to determine just the type of the
>>>>>> object without requiring every other piece of info?
>>>>>   Ah, yes, we've noticed that in some of the applications we've
>>>>> worked with also (including some of the main HDF5 tools, like h5ls,
>>>>> etc). As you say, H5Oget_info() is fairly heavyweight, getting all
>>>>> sorts of information about each object.  I do think a lighter-weight
>>>>> call like "H5Oget_type" would be useful.  Is there other
>>>>> "lightweight" information that people would like back for each
>>>>> object?
>>>>> 
>>>>>   Quincey
>>>>> 
>>>> Hi Quincey,
>>>> 
>>>> thanks for confirming this. Could you explain briefly what is
>>>> going on there and which part of H5O_info_t needs so many reads?
>>>     The H5Oget_info() call is gathering information about the amount of
>>> space that the metadata for the dataset is using.  When there's a large
>>> B- tree for indexing the chunks, that can take a fair bit of time to
>>> walk the B-tree.
>>> 
>>>>   Maybe removing heavyweight info from H5O_info_t is the right
>>>> thing to do, or creating another version of H5O_info_t structure
>>>> which has only light-weight info?
>>>     I'm leaning toward another light-weight version.  I'm asking the HDF5
>>> community to help me decide what goes into that structure besides the
>>> object type.
>>> 
>> Hi Quincey,
>> 
>> is there a chance we can get this new version in the next release?
> 
> We actually already have an experimental branch with a similar feature
> mostly implemented.  It allows you to specify the fields you want filled
> in by H5Oget_info.  The branch can be found at:
> 
> http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/
> 
> The new functions are:
> 
> herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned fields);
> herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name, H5O_info_t
> *oinfo, unsigned fields, hid_t lapl_id);
> 
> The "fields" parameter can contain the following bitflags (combined with
> "|"):
> 
> H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE
> H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR |
> H5O_INFO_META_SIZE)
> 
> Passing these flags tells the library to fill in the corresponding
> fields in oinfo.  Other fields are always filled in because there is no
> performance penalty.  In your case, since you only need the type, you
> can just pass "0".  h5ls has also been modified to use these, so it
> should be faster.
> 
> Of course, this is experimental code and should not be used in
> production, but if you're curious how much a lightweight H5Oget_info
> would help your performance you're welcome to try it.  If you do, we'd
> love to hear about your results, and also your thoughts on the
> interface.  For maximum performance, you should configure the library
> with "--enable-production" (for this branch, not necessary for releases).
> 
> Thanks,
> -Neil
> 

Hi Neil,

I managed to build this branch and test it. It has indeed improved 
performance dramatically. As you suggest I only use zero value for the 
fields argument, other values have not been included in my test.
With that value and checking only the "type" field in H5O_info_t it
runs much faster than previous version.'h5ls' also works better on our 
files.

What I find interesting is a missing version for H5Oget_info_by_idx
which would take "fields" argument. Is this function so much different 
from H5Oget_info and H5Oget_info_by_name so it cannot be optimized?

Even without H5Oget_info_by_idx2 I'd be happy to see this branch 
included into next release.

Cheers,
Andy


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to