Re: [Hdf-forum] Indexing and fixed number of groups

Matthew Turk Mon, 28 Nov 2011 07:31:14 -0800

Hi Quincey,

Sorry for the delay over the weekend.  I have gone ahead and posted a
sample dataset, and I have used this code:


http://paste.yt-project.org/show/1961/

to open it.  Just now I ran this and received:

Read: 0 with time 53.893453 over 1000 runs
Read: 1 with time 197.589276 over 1000 runs

Thanks for any ideas you might have.  We end up controlling both the
creation and consumption of these files, for the most part, so we're
eager for solutions that will assist with either end.  (And if there's
anything particularly bad or naive in my code, that would be
appreciated, too!)  As a side note, for the most part we also usually
use either C++ or h5py to read files for analysis, which is where we
are hit particularly hard.

Best,

Matt

On Wed, Nov 23, 2011 at 1:24 PM, Quincey Koziol <[email protected]> wrote:
> Hi Matt,
>
> On Nov 23, 2011, at 11:08 AM, Matthew Turk wrote:
>
>> Hi Quincey,
>>
>> On Wed, Nov 23, 2011 at 11:22 AM, Quincey Koziol <[email protected]> wrote:
>>> Hi Matt,
>>>
>>> On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm a long time user of HDF5 (mostly via the Enzo project), but new to
>>>> optimizing and attempting to really take advantage of the features of
>>>> the library.  We're seeing substantial performance problems at the
>>>> present, and we're attempting to narrow it down.  As a bit of
>>>> background, the data in our files is structured such that we have
>>>> top-level groups (of the format /Grid00000001 , /Grid00000002 , etc)
>>>> and then off of each group hang a fixed number of datasets, and the
>>>> files themselves are write-once read-many.  We're in the position that
>>>> we know in advance exactly how many groups we have, how many datasets
>>>> hang off of each group (or at least a reasonable upper bound) and all
>>>> of our data is streamed exactly out with no chunking.
>>>>
>>>> What we've found lately is that about 30% of the time to read a
>>>> dataset in occurs just in opening the individual grids.  The remainder
>>>> is the actual calls to read the data.  My naive guess at the source of
>>>> this behavior is that the opening of the groups involves reading a
>>>> potentially distributed index.  Because of our particular situation --
>>>> a fixed number of groups and datasets and inviolate data-on-disk -- is
>>>> there a particular mechanism or parameter we could set by which we
>>>> could speed up access to the groups and datasets?
>>>
>>>        There's no distributed index really, each group just has a heap with 
>>> the link info in it and a B-tree that indexes them.  How large are the 
>>> files you are accessing?  Are you using serial or parallel access to them?  
>>> What system/file system are you using?
>>
>> Thanks for your reply!  Here's some more detailed info about the
>> files.  The tests have been conducted on a couple file systems --
>> local disk (ext4), NFS and lustre.  The files themselves are about
>> ~150-250mb, but we can often see much larger files.  The files have on
>> order of 200-300 groups (grids) per file, each of which has ~40
>> datasets (all of which share roughly the same set of names).  The
>> datasets themselves are somewhat small -- we have both 3D and 1D
>> datasets, and the 3D datasets on average contain ~10000 elements.  All
>> access (read and write) is done in serial to a single file.
>>
>> I suppose my question was ill-posed; what I was wondering about is if
>> there might be any kind of way to speed up group opening.  The numbers
>> I quoted earlier, of about 30% for grid opening, are a medium case.
>> In some cases (on local disk, running over the same file 100 times and
>> averaging) we actually see that opening and closing the groups takes
>> about 50-60% of the time that opening the groups and reading the data
>> takes.  I suppose a more broad question is, should I be surprised at
>> this?  Or is this to be expected, based on how HDF5 operates (and all
>> of the utility it provides)?  Are there low-hanging fruit that I
>> should be addressing with the data handling?
>
>        I would think that this should be faster...  Do you have a test 
> program and file(s) that I could profile with?
>
>        Quincey
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Indexing and fixed number of groups

Reply via email to