Hi Matt, On Nov 23, 2011, at 11:08 AM, Matthew Turk wrote:
> Hi Quincey, > > On Wed, Nov 23, 2011 at 11:22 AM, Quincey Koziol <[email protected]> wrote: >> Hi Matt, >> >> On Nov 22, 2011, at 1:11 PM, Matthew Turk wrote: >> >>> Hi all, >>> >>> I'm a long time user of HDF5 (mostly via the Enzo project), but new to >>> optimizing and attempting to really take advantage of the features of >>> the library. We're seeing substantial performance problems at the >>> present, and we're attempting to narrow it down. As a bit of >>> background, the data in our files is structured such that we have >>> top-level groups (of the format /Grid00000001 , /Grid00000002 , etc) >>> and then off of each group hang a fixed number of datasets, and the >>> files themselves are write-once read-many. We're in the position that >>> we know in advance exactly how many groups we have, how many datasets >>> hang off of each group (or at least a reasonable upper bound) and all >>> of our data is streamed exactly out with no chunking. >>> >>> What we've found lately is that about 30% of the time to read a >>> dataset in occurs just in opening the individual grids. The remainder >>> is the actual calls to read the data. My naive guess at the source of >>> this behavior is that the opening of the groups involves reading a >>> potentially distributed index. Because of our particular situation -- >>> a fixed number of groups and datasets and inviolate data-on-disk -- is >>> there a particular mechanism or parameter we could set by which we >>> could speed up access to the groups and datasets? >> >> There's no distributed index really, each group just has a heap with >> the link info in it and a B-tree that indexes them. How large are the files >> you are accessing? Are you using serial or parallel access to them? What >> system/file system are you using? > > Thanks for your reply! Here's some more detailed info about the > files. The tests have been conducted on a couple file systems -- > local disk (ext4), NFS and lustre. The files themselves are about > ~150-250mb, but we can often see much larger files. The files have on > order of 200-300 groups (grids) per file, each of which has ~40 > datasets (all of which share roughly the same set of names). The > datasets themselves are somewhat small -- we have both 3D and 1D > datasets, and the 3D datasets on average contain ~10000 elements. All > access (read and write) is done in serial to a single file. > > I suppose my question was ill-posed; what I was wondering about is if > there might be any kind of way to speed up group opening. The numbers > I quoted earlier, of about 30% for grid opening, are a medium case. > In some cases (on local disk, running over the same file 100 times and > averaging) we actually see that opening and closing the groups takes > about 50-60% of the time that opening the groups and reading the data > takes. I suppose a more broad question is, should I be surprised at > this? Or is this to be expected, based on how HDF5 operates (and all > of the utility it provides)? Are there low-hanging fruit that I > should be addressing with the data handling? I would think that this should be faster... Do you have a test program and file(s) that I could profile with? Quincey _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
