Hi Balint,
Zaak's suggestion is the correct way to allocate all the chunks for the
dataset at creation time. However, I'm more concerned about the version of the
HDF5 library you are using - we are currently at release 1.8.8 and there have
been _many_ performance improvements in the chunked dataset I/O code since
1.6.5 (BTW, the final release of the 1.6.x branch was 1.6.10). I would suggest
writing a short C program to benchmark your access pattern and then test it
against both the latest 1.6.x release and the 1.8.x release. If there is still
a performance problem, we can look into it.
Quincey
On Dec 13, 2011, at 1:01 PM, Zaak Beekman wrote:
> Balint,
> I am not sure whether pre-allocation will help the performance but there is a
> good chance it may since the default for chunked data sets is to allocate
> space incrementally (chunk by chunk) as data is written to the data set,
> especially if the chunks are small and there are a lot of them. If matlab has
> access to the low level HDF5 APIs (which I beleive it does) you can use the
> H5Pset_alloc_time and pass alloc_time as H5D_ALLOC_TIME_EARLY to set a
> dataset creation property list. There should be no need to mess with the fill
> value or do any filling as far as I can tell.
>
> You will need to create a property list first then set this property then
> pass it in to H5Dcreate. Also, I think matlab splits the HDF5 API into
> classes so the function might look like H5P.set_alloc_time or something like
> that. It might also be worth while to check that matlab is a recent version
> so that it is compiled/linked against a recent HDF5 version/build.
>
> Documentation for H5Pset_alloc_time may be found here:
> http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetAllocTime
>
> Good luck,
> Izaak Beekman
> ===================================
> (301)244-9367
> Princeton University Doctoral Candidate
> Mechanical and Aerospace Engineering
> [email protected]
>
> UMD-CP Visiting Graduate Student
> Aerospace Engineering
> [email protected]
> [email protected]
>
>
> On Tue, Dec 13, 2011 at 12:00 PM, <[email protected]> wrote:
> Send Hdf-forum mailing list submissions to
> [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> or, via email, send a message with subject or body 'help' to
> [email protected]
>
> You can reach the person managing the list at
> [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Hdf-forum digest..."
>
> Today's Topics:
>
> 1. Datasets keep old names after parent has been renamed?
> (Darren Dale)
> 2. speeding up write of chunked HDF (Balint Takacs)
>
>
> ---------- Forwarded message ----------
> From: Darren Dale <[email protected]>
> To: HDF Users Discussion List <[email protected]>
> Cc:
> Date: Mon, 12 Dec 2011 12:29:18 -0500
> Subject: [Hdf-forum] Datasets keep old names after parent has been renamed?
> (Apologies if this gets posted twice)
>
> Someone reported a bug at the h5py issue tracker:
>
> ---
> import h5py
>
> # test setup
> fid = h5py.File('test.hdf5', 'w')
>
> g = fid.create_group('old_loc')
> g2 = g.create_group('group')
> d = g.create_dataset('dataset', data=0)
>
> print "before move:"
> print g2.name
> print d.name
>
> # now rename toplevel group
> g.parent.id.move('old_loc', 'new_loc')
>
> print "after move:"
> # old parent remains in dataset name, group is ok
> print g2.name
> print d.name
>
> # parent is accessed by name 'g' which does not exist any more
> d.parent
>
> fid.close()
> ---
>
> That script produces the following output:
>
> ---
> before move:
> /old_loc/group
> /old_loc/dataset
> after move:
> /new_loc/group
> /old_loc/dataset
> Traceback (most recent call last):
> File "move_error.py", line 24, in <module>
> d.parent
> File
> "/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/base.py",
> line 144, in parent
> return self.file[posixpath.dirname(self.name)]
> File
> "/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/group.py",
> line 128, in __getitem__
> oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
> File "h5o.pyx", line 176, in h5py.h5o.open (h5py/h5o.c:2814)
> KeyError: "unable to open object (Symbol table: Can't open object)"
> ---
>
> g.name and d.name simply return the result of h5i.get_name.
>
> d.parent just splits d.name at the last "/" and returns the the first
> part of the split.
>
> g.parent.id.move calls H5Gmove2. I've read the warnings about
> corrupting data using H5Gmove at
> http://www.hdfgroup.org/HDF5/doc1.6/Groups.html#H5GUnlinkToCorrupt ,
> but the situation described there does not appear to be relevant to
> the problem we are seeing. Is h5py not performing the move properly,
> or could this be a bug in HDF5?
>
> Thanks,
> Darren
>
>
>
>
> ---------- Forwarded message ----------
> From: Balint Takacs <[email protected]>
> To: [email protected]
> Cc:
> Date: Tue, 13 Dec 2011 12:37:08 +0000
> Subject: [Hdf-forum] speeding up write of chunked HDF
> Hi all,
>
> I need to fill a huge 3D array, chunked in its second dimension. My data are
> coming as slices with a fixed index in the third dimension, so the layout
> needs to be re-ordered. The chunks are uncompressed. When the data is read,
> the access pattern sweeps through it in the second dimension, so the chunking
> layout makes sense. The data is stored on an SSD, so random access should be
> relatively fast. I cannot manipulate data index order.
>
> In theory, when filling up the array, the data could be continuously written
> if it were to be stored in a raw file. However, with HDF this becomes
> painfully slow. The only way I found to speed this up somewhat is to read as
> much slices I can into memory, and then write together in batches, but I
> still experience <2MB/sec write transfers on average.
>
> The file is gradually growing as the slices are added. If this expansion
> requires re-ordering the entire data, this could explain the slow write
> speed. I was wondering whether pre-allocating the entire file somehow could
> help with this, and what is the best way to do it. I could not find any
> related API function. I know the entire data size before the data collection
> starts.
>
> The only idea I have so far is to fill the array with some dummy value (not
> the fill one) by sweeping through the chunking dimension before adding the
> slices. This would probably grow the file to its final size rapidly, but I am
> not sure that this helps at all, and is definitely ugly.
>
> I am using MATLAB 2007a with the 1.6.5 HDF library it is coming with.
>
> Thank you for you comments in advance.
>
> Regards,
>
> Balint
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org