Hi Quincey,

On Wed, Nov 23, 2011 at 11:22 AM, Quincey Koziol <[email protected]> wrote:
> Hi Matt,
>
> On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote:
>
>> Hi all,
>>
>> I'm a long time user of HDF5 (mostly via the Enzo project), but new to
>> optimizing and attempting to really take advantage of the features of
>> the library.  We're seeing substantial performance problems at the
>> present, and we're attempting to narrow it down.  As a bit of
>> background, the data in our files is structured such that we have
>> top-level groups (of the format /Grid00000001 , /Grid00000002 , etc)
>> and then off of each group hang a fixed number of datasets, and the
>> files themselves are write-once read-many.  We're in the position that
>> we know in advance exactly how many groups we have, how many datasets
>> hang off of each group (or at least a reasonable upper bound) and all
>> of our data is streamed exactly out with no chunking.
>>
>> What we've found lately is that about 30% of the time to read a
>> dataset in occurs just in opening the individual grids.  The remainder
>> is the actual calls to read the data.  My naive guess at the source of
>> this behavior is that the opening of the groups involves reading a
>> potentially distributed index.  Because of our particular situation --
>> a fixed number of groups and datasets and inviolate data-on-disk -- is
>> there a particular mechanism or parameter we could set by which we
>> could speed up access to the groups and datasets?
>
>        There's no distributed index really, each group just has a heap with 
> the link info in it and a B-tree that indexes them.  How large are the files 
> you are accessing?  Are you using serial or parallel access to them?  What 
> system/file system are you using?

Thanks for your reply!  Here's some more detailed info about the
files.  The tests have been conducted on a couple file systems --
local disk (ext4), NFS and lustre.  The files themselves are about
~150-250mb, but we can often see much larger files.  The files have on
order of 200-300 groups (grids) per file, each of which has ~40
datasets (all of which share roughly the same set of names).  The
datasets themselves are somewhat small -- we have both 3D and 1D
datasets, and the 3D datasets on average contain ~10000 elements.  All
access (read and write) is done in serial to a single file.

I suppose my question was ill-posed; what I was wondering about is if
there might be any kind of way to speed up group opening.  The numbers
I quoted earlier, of about 30% for grid opening, are a medium case.
In some cases (on local disk, running over the same file 100 times and
averaging) we actually see that opening and closing the groups takes
about 50-60% of the time that opening the groups and reading the data
takes.  I suppose a more broad question is, should I be surprised at
this?  Or is this to be expected, based on how HDF5 operates (and all
of the utility it provides)?  Are there low-hanging fruit that I
should be addressing with the data handling?

Thanks so much for any ideas you might have,

Matt

>
>        Quincey
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to