Andy,
On 03/20/2011 02:28 AM, Salnikov, Andrei A. wrote:
Neil Fortner wrote on 2011-03-14:
Andy,
On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote:
Quincey Koziol wrote on 2011-03-10:
Hi Andy,
On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:
Quincey Koziol wrote on 2011-03-09:
Hi Andy,
On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:
Hi,
I'm trying to understand a performance hit that we are
experiencing trying to examine the tree structure of
our HDF5 files. Originally we observed problem when
using h5py but it could be reproduced even with h5ls
command. I tracked it down to a significant delay in
the call to H5Oget_info_by_name function on a dataset
with a large number of chunks. It looks like when the
number of chunks in dataset increases (in our case
we have 1-10k chunks) the performance of the H5Oget_info
drops significantly. Looking at the IO statistics it
seems that HDF5 library does very many small IO operations
in this case. There is very little CPU spent, but real
time is measured in tens of seconds.
Is this an expected behavior? Can it be improved somehow
without reducing the number of chunks drastically?
One more comment about H5Oget_info - it returns a
structure that contains a lot of different info.
In the case of h5py code the only member of the
structure used in the code is "type". could there be
more efficient way to determine just the type of the
object without requiring every other piece of info?
Ah, yes, we've noticed that in some of the applications we've
worked with also (including some of the main HDF5 tools, like h5ls,
etc). As you say, H5Oget_info() is fairly heavyweight, getting all
sorts of information about each object. I do think a lighter-weight
call like "H5Oget_type" would be useful. Is there other
"lightweight" information that people would like back for each
object?
Quincey
Hi Quincey,
thanks for confirming this. Could you explain briefly what is
going on there and which part of H5O_info_t needs so many reads?
The H5Oget_info() call is gathering information about the amount of
space that the metadata for the dataset is using. When there's a large
B- tree for indexing the chunks, that can take a fair bit of time to
walk the B-tree.
Maybe removing heavyweight info from H5O_info_t is the right
thing to do, or creating another version of H5O_info_t structure
which has only light-weight info?
I'm leaning toward another light-weight version. I'm asking the HDF5
community to help me decide what goes into that structure besides the
object type.
Hi Quincey,
is there a chance we can get this new version in the next release?
We actually already have an experimental branch with a similar feature
mostly implemented. It allows you to specify the fields you want filled
in by H5Oget_info. The branch can be found at:
http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/
The new functions are:
herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned fields);
herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name, H5O_info_t
*oinfo, unsigned fields, hid_t lapl_id);
The "fields" parameter can contain the following bitflags (combined with
"|"):
H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE
H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR |
H5O_INFO_META_SIZE)
Passing these flags tells the library to fill in the corresponding
fields in oinfo. Other fields are always filled in because there is no
performance penalty. In your case, since you only need the type, you
can just pass "0". h5ls has also been modified to use these, so it
should be faster.
Of course, this is experimental code and should not be used in
production, but if you're curious how much a lightweight H5Oget_info
would help your performance you're welcome to try it. If you do, we'd
love to hear about your results, and also your thoughts on the
interface. For maximum performance, you should configure the library
with "--enable-production" (for this branch, not necessary for releases).
Thanks,
-Neil
Hi Neil,
I managed to build this branch and test it. It has indeed improved
performance dramatically. As you suggest I only use zero value for the
fields argument, other values have not been included in my test.
With that value and checking only the "type" field in H5O_info_t it
runs much faster than previous version.'h5ls' also works better on our
files.
What I find interesting is a missing version for H5Oget_info_by_idx
which would take "fields" argument. Is this function so much different
from H5Oget_info and H5Oget_info_by_name so it cannot be optimized?
Even without H5Oget_info_by_idx2 I'd be happy to see this branch
included into next release.
Glad to hear it improved your performance! It would be easy to add
H5Oget_info_by_idx2, we just didn't do that because we only did the
minimum needed to test the performance in the case we were looking at,
and stopped after reaching that point. We shelved the work because it
didn't make a huge difference in the case we were looking at, but with
your report I will look into getting it scheduled sooner rather than
later. There is a chance we may change the interface to something like
what Quincey suggested. Thanks for taking the time to test this!
-Neil
Cheers,
Andy
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org