Re: [Hdf-forum] speeding up write of chunked HDF

Quincey Koziol Wed, 14 Dec 2011 05:11:30 -0800

Hi Balint,
        Zaak's suggestion is the correct way to allocate all the chunks for the 
dataset at creation time.  However, I'm more concerned about the version of the 
HDF5 library you are using - we are currently at release 1.8.8 and there have 
been _many_ performance improvements in the chunked dataset I/O code since 
1.6.5 (BTW, the final release of the 1.6.x branch was 1.6.10).  I would suggest 
writing a short C program to benchmark your access pattern and then test it 
against both the latest 1.6.x release and the 1.8.x release.  If there is still 
a performance problem, we can look into it.


        Quincey

On Dec 13, 2011, at 1:01 PM, Zaak Beekman wrote:

> Balint,
> I am not sure whether pre-allocation will help the performance but there is a 
> good chance it may since the default for chunked data sets is to allocate 
> space incrementally (chunk by chunk) as data is written to the data set, 
> especially if the chunks are small and there are a lot of them. If matlab has 
> access to the low level HDF5 APIs (which I beleive it does) you can use the 
> H5Pset_alloc_time and pass alloc_time as H5D_ALLOC_TIME_EARLY to set a 
> dataset creation property list. There should be no need to mess with the fill 
> value or do any filling as far as I can tell. 
> 
> You will need to create a property list first then set this property then 
> pass it in to H5Dcreate. Also, I think matlab splits the HDF5 API into 
> classes so the function might look like H5P.set_alloc_time or something like 
> that. It might also be worth while to check that matlab is a recent version 
> so that it is compiled/linked against a recent HDF5 version/build.
> 
> Documentation for H5Pset_alloc_time may be found here: 
> http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetAllocTime
> 
> Good luck,
> Izaak Beekman
> ===================================
> (301)244-9367
> Princeton University Doctoral Candidate
> Mechanical and Aerospace Engineering
> [email protected]
> 
> UMD-CP Visiting Graduate Student
> Aerospace Engineering
> [email protected]
> [email protected]
> 
> 
> On Tue, Dec 13, 2011 at 12:00 PM, <[email protected]> wrote:
> Send Hdf-forum mailing list submissions to
>        [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> or, via email, send a message with subject or body 'help' to
>        [email protected]
> 
> You can reach the person managing the list at
>        [email protected]
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Hdf-forum digest..."
> 
> Today's Topics:
> 
>   1. Datasets keep old names after parent has been renamed?
>      (Darren Dale)
>   2. speeding up write of chunked HDF (Balint Takacs)
> 
> 
> ---------- Forwarded message ----------
> From: Darren Dale <[email protected]>
> To: HDF Users Discussion List <[email protected]>
> Cc: 
> Date: Mon, 12 Dec 2011 12:29:18 -0500
> Subject: [Hdf-forum] Datasets keep old names after parent has been renamed?
> (Apologies if this gets posted twice)
> 
> Someone reported a bug at the h5py issue tracker:
> 
> ---
> import h5py
> 
> # test setup
> fid = h5py.File('test.hdf5', 'w')
> 
> g = fid.create_group('old_loc')
> g2 = g.create_group('group')
> d = g.create_dataset('dataset', data=0)
> 
> print "before move:"
> print g2.name
> print d.name
> 
> # now rename toplevel group
> g.parent.id.move('old_loc', 'new_loc')
> 
> print "after move:"
> # old parent remains in dataset name, group is ok
> print g2.name
> print d.name
> 
> # parent is accessed by name 'g' which does not exist any more
> d.parent
> 
> fid.close()
> ---
> 
> That script produces the following output:
> 
> ---
> before move:
> /old_loc/group
> /old_loc/dataset
> after move:
> /new_loc/group
> /old_loc/dataset
> Traceback (most recent call last):
>  File "move_error.py", line 24, in <module>
>   d.parent
>  File 
> "/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/base.py",
> line 144, in parent
>   return self.file[posixpath.dirname(self.name)]
>  File 
> "/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/group.py",
> line 128, in __getitem__
>   oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
>  File "h5o.pyx", line 176, in h5py.h5o.open (h5py/h5o.c:2814)
> KeyError: "unable to open object (Symbol table: Can't open object)"
> ---
> 
> g.name and d.name simply return the result of h5i.get_name.
> 
> d.parent just splits d.name at the last "/" and returns the the first
> part of the split.
> 
> g.parent.id.move calls H5Gmove2. I've read the warnings about
> corrupting data using H5Gmove at
> http://www.hdfgroup.org/HDF5/doc1.6/Groups.html#H5GUnlinkToCorrupt ,
> but the situation described there does not appear to be relevant to
> the problem we are seeing. Is h5py not performing the move properly,
> or could this be a bug in HDF5?
> 
> Thanks,
> Darren
> 
> 
> 
> 
> ---------- Forwarded message ----------
> From: Balint Takacs <[email protected]>
> To: [email protected]
> Cc: 
> Date: Tue, 13 Dec 2011 12:37:08 +0000
> Subject: [Hdf-forum] speeding up write of chunked HDF
> Hi all,
> 
> I need to fill a huge 3D array, chunked in its second dimension. My data are 
> coming as slices with a fixed index in the third dimension, so the layout 
> needs to be re-ordered. The chunks are uncompressed. When the data is read, 
> the access pattern sweeps through it in the second dimension, so the chunking 
> layout makes sense. The data is stored on an SSD, so random access should be 
> relatively fast. I cannot manipulate data index order.
> 
> In theory, when filling up the array, the data could be continuously written 
> if it were to be stored in a raw file. However, with HDF this becomes 
> painfully slow. The only way I found to speed this up somewhat is to read as 
> much slices I can into memory, and then write together in batches, but I 
> still experience <2MB/sec write transfers on average.
> 
> The file is gradually growing as the slices are added. If this expansion 
> requires re-ordering the entire data, this could explain the slow write 
> speed. I was wondering whether pre-allocating the entire file somehow could 
> help with this, and what is the best way to do it. I could not find any 
> related API function. I know the entire data size before the data collection 
> starts.
> 
> The only idea I have so far is to fill the array with some dummy value (not 
> the fill one) by sweeping through the chunking dimension before adding the 
> slices. This would probably grow the file to its final size rapidly, but I am 
> not sure that this helps at all, and is definitely ugly.
> 
> I am using MATLAB 2007a with the 1.6.5 HDF library it is coming with.
> 
> Thank you for you comments in advance.
> 
> Regards,
> 
> Balint
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] speeding up write of chunked HDF

Reply via email to