Hi Quincey, On Wed, Nov 23, 2011 at 11:22 AM, Quincey Koziol <[email protected]> wrote: > Hi Matt, > > On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote: > >> Hi all, >> >> I'm a long time user of HDF5 (mostly via the Enzo project), but new to >> optimizing and attempting to really take advantage of the features of >> the library. We're seeing substantial performance problems at the >> present, and we're attempting to narrow it down. As a bit of >> background, the data in our files is structured such that we have >> top-level groups (of the format /Grid00000001 , /Grid00000002 , etc) >> and then off of each group hang a fixed number of datasets, and the >> files themselves are write-once read-many. We're in the position that >> we know in advance exactly how many groups we have, how many datasets >> hang off of each group (or at least a reasonable upper bound) and all >> of our data is streamed exactly out with no chunking. >> >> What we've found lately is that about 30% of the time to read a >> dataset in occurs just in opening the individual grids. The remainder >> is the actual calls to read the data. My naive guess at the source of >> this behavior is that the opening of the groups involves reading a >> potentially distributed index. Because of our particular situation -- >> a fixed number of groups and datasets and inviolate data-on-disk -- is >> there a particular mechanism or parameter we could set by which we >> could speed up access to the groups and datasets? > > There's no distributed index really, each group just has a heap with > the link info in it and a B-tree that indexes them. How large are the files > you are accessing? Are you using serial or parallel access to them? What > system/file system are you using?
Thanks for your reply! Here's some more detailed info about the files. The tests have been conducted on a couple file systems -- local disk (ext4), NFS and lustre. The files themselves are about ~150-250mb, but we can often see much larger files. The files have on order of 200-300 groups (grids) per file, each of which has ~40 datasets (all of which share roughly the same set of names). The datasets themselves are somewhat small -- we have both 3D and 1D datasets, and the 3D datasets on average contain ~10000 elements. All access (read and write) is done in serial to a single file. I suppose my question was ill-posed; what I was wondering about is if there might be any kind of way to speed up group opening. The numbers I quoted earlier, of about 30% for grid opening, are a medium case. In some cases (on local disk, running over the same file 100 times and averaging) we actually see that opening and closing the groups takes about 50-60% of the time that opening the groups and reading the data takes. I suppose a more broad question is, should I be surprised at this? Or is this to be expected, based on how HDF5 operates (and all of the utility it provides)? Are there low-hanging fruit that I should be addressing with the data handling? Thanks so much for any ideas you might have, Matt > > Quincey > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
