Hi Matt,
On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote:
> Hi all,
>
> I'm a long time user of HDF5 (mostly via the Enzo project), but new to
> optimizing and attempting to really take advantage of the features of
> the library. We're seeing substantial performance problems at the
> present, and we're attempting to narrow it down. As a bit of
> background, the data in our files is structured such that we have
> top-level groups (of the format /Grid00000001 , /Grid00000002 , etc)
> and then off of each group hang a fixed number of datasets, and the
> files themselves are write-once read-many. We're in the position that
> we know in advance exactly how many groups we have, how many datasets
> hang off of each group (or at least a reasonable upper bound) and all
> of our data is streamed exactly out with no chunking.
>
> What we've found lately is that about 30% of the time to read a
> dataset in occurs just in opening the individual grids. The remainder
> is the actual calls to read the data. My naive guess at the source of
> this behavior is that the opening of the groups involves reading a
> potentially distributed index. Because of our particular situation --
> a fixed number of groups and datasets and inviolate data-on-disk -- is
> there a particular mechanism or parameter we could set by which we
> could speed up access to the groups and datasets?
There's no distributed index really, each group just has a heap with
the link info in it and a B-tree that indexes them. How large are the files
you are accessing? Are you using serial or parallel access to them? What
system/file system are you using?
Quincey
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org