Neil Fortner wrote on 2011-03-14: > Andy, > > On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote: >> Quincey Koziol wrote on 2011-03-10: >>> Hi Andy, >>> >>> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote: >>> >>>> Quincey Koziol wrote on 2011-03-09: >>>>> Hi Andy, >>>>> >>>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm trying to understand a performance hit that we are >>>>>> experiencing trying to examine the tree structure of >>>>>> our HDF5 files. Originally we observed problem when >>>>>> using h5py but it could be reproduced even with h5ls >>>>>> command. I tracked it down to a significant delay in >>>>>> the call to H5Oget_info_by_name function on a dataset >>>>>> with a large number of chunks. It looks like when the >>>>>> number of chunks in dataset increases (in our case >>>>>> we have 1-10k chunks) the performance of the H5Oget_info >>>>>> drops significantly. Looking at the IO statistics it >>>>>> seems that HDF5 library does very many small IO operations >>>>>> in this case. There is very little CPU spent, but real >>>>>> time is measured in tens of seconds. >>>>>> >>>>>> Is this an expected behavior? Can it be improved somehow >>>>>> without reducing the number of chunks drastically? >>>>>> >>>>>> One more comment about H5Oget_info - it returns a >>>>>> structure that contains a lot of different info. >>>>>> In the case of h5py code the only member of the >>>>>> structure used in the code is "type". could there be >>>>>> more efficient way to determine just the type of the >>>>>> object without requiring every other piece of info? >>>>> Ah, yes, we've noticed that in some of the applications we've >>>>> worked with also (including some of the main HDF5 tools, like h5ls, >>>>> etc). As you say, H5Oget_info() is fairly heavyweight, getting all >>>>> sorts of information about each object. I do think a lighter-weight >>>>> call like "H5Oget_type" would be useful. Is there other >>>>> "lightweight" information that people would like back for each >>>>> object? >>>>> >>>>> Quincey >>>>> >>>> Hi Quincey, >>>> >>>> thanks for confirming this. Could you explain briefly what is >>>> going on there and which part of H5O_info_t needs so many reads? >>> The H5Oget_info() call is gathering information about the amount of >>> space that the metadata for the dataset is using. When there's a large >>> B- tree for indexing the chunks, that can take a fair bit of time to >>> walk the B-tree. >>> >>>> Maybe removing heavyweight info from H5O_info_t is the right >>>> thing to do, or creating another version of H5O_info_t structure >>>> which has only light-weight info? >>> I'm leaning toward another light-weight version. I'm asking the HDF5 >>> community to help me decide what goes into that structure besides the >>> object type. >>> >> Hi Quincey, >> >> is there a chance we can get this new version in the next release? > > We actually already have an experimental branch with a similar feature > mostly implemented. It allows you to specify the fields you want filled > in by H5Oget_info. The branch can be found at: > > http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/ > > The new functions are: > > herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned fields); > herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name, H5O_info_t > *oinfo, unsigned fields, hid_t lapl_id); > > The "fields" parameter can contain the following bitflags (combined with > "|"): > > H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE > H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR | > H5O_INFO_META_SIZE) > > Passing these flags tells the library to fill in the corresponding > fields in oinfo. Other fields are always filled in because there is no > performance penalty. In your case, since you only need the type, you > can just pass "0". h5ls has also been modified to use these, so it > should be faster. > > Of course, this is experimental code and should not be used in > production, but if you're curious how much a lightweight H5Oget_info > would help your performance you're welcome to try it. If you do, we'd > love to hear about your results, and also your thoughts on the > interface. For maximum performance, you should configure the library > with "--enable-production" (for this branch, not necessary for releases). > > Thanks, > -Neil >
Hi Neil, I managed to build this branch and test it. It has indeed improved performance dramatically. As you suggest I only use zero value for the fields argument, other values have not been included in my test. With that value and checking only the "type" field in H5O_info_t it runs much faster than previous version.'h5ls' also works better on our files. What I find interesting is a missing version for H5Oget_info_by_idx which would take "fields" argument. Is this function so much different from H5Oget_info and H5Oget_info_by_name so it cannot be optimized? Even without H5Oget_info_by_idx2 I'd be happy to see this branch included into next release. Cheers, Andy _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
