Re: [Hdf-forum] Indexing and fixed number of groups

Quincey Koziol Wed, 23 Nov 2011 10:26:47 -0800

Hi Matt,

On Nov 23, 2011, at 11:08 AM, Matthew Turk wrote:


> Hi Quincey,
> 
> On Wed, Nov 23, 2011 at 11:22 AM, Quincey Koziol <[email protected]> wrote:
>> Hi Matt,
>> 
>> On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote:
>> 
>>> Hi all,
>>> 
>>> I'm a long time user of HDF5 (mostly via the Enzo project), but new to
>>> optimizing and attempting to really take advantage of the features of
>>> the library.  We're seeing substantial performance problems at the
>>> present, and we're attempting to narrow it down.  As a bit of
>>> background, the data in our files is structured such that we have
>>> top-level groups (of the format /Grid00000001 , /Grid00000002 , etc)
>>> and then off of each group hang a fixed number of datasets, and the
>>> files themselves are write-once read-many.  We're in the position that
>>> we know in advance exactly how many groups we have, how many datasets
>>> hang off of each group (or at least a reasonable upper bound) and all
>>> of our data is streamed exactly out with no chunking.
>>> 
>>> What we've found lately is that about 30% of the time to read a
>>> dataset in occurs just in opening the individual grids.  The remainder
>>> is the actual calls to read the data.  My naive guess at the source of
>>> this behavior is that the opening of the groups involves reading a
>>> potentially distributed index.  Because of our particular situation --
>>> a fixed number of groups and datasets and inviolate data-on-disk -- is
>>> there a particular mechanism or parameter we could set by which we
>>> could speed up access to the groups and datasets?
>> 
>>        There's no distributed index really, each group just has a heap with 
>> the link info in it and a B-tree that indexes them.  How large are the files 
>> you are accessing?  Are you using serial or parallel access to them?  What 
>> system/file system are you using?
> 
> Thanks for your reply!  Here's some more detailed info about the
> files.  The tests have been conducted on a couple file systems --
> local disk (ext4), NFS and lustre.  The files themselves are about
> ~150-250mb, but we can often see much larger files.  The files have on
> order of 200-300 groups (grids) per file, each of which has ~40
> datasets (all of which share roughly the same set of names).  The
> datasets themselves are somewhat small -- we have both 3D and 1D
> datasets, and the 3D datasets on average contain ~10000 elements.  All
> access (read and write) is done in serial to a single file.
> 
> I suppose my question was ill-posed; what I was wondering about is if
> there might be any kind of way to speed up group opening.  The numbers
> I quoted earlier, of about 30% for grid opening, are a medium case.
> In some cases (on local disk, running over the same file 100 times and
> averaging) we actually see that opening and closing the groups takes
> about 50-60% of the time that opening the groups and reading the data
> takes.  I suppose a more broad question is, should I be surprised at
> this?  Or is this to be expected, based on how HDF5 operates (and all
> of the utility it provides)?  Are there low-hanging fruit that I
> should be addressing with the data handling?

        I would think that this should be faster...  Do you have a test program 
and file(s) that I could profile with?

        Quincey


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Indexing and fixed number of groups

Reply via email to