Hi Balint,
        Zaak's suggestion is the correct way to allocate all the chunks for the 
dataset at creation time.  However, I'm more concerned about the version of the 
HDF5 library you are using - we are currently at release 1.8.8 and there have 
been _many_ performance improvements in the chunked dataset I/O code since 
1.6.5 (BTW, the final release of the 1.6.x branch was 1.6.10).  I would suggest 
writing a short C program to benchmark your access pattern and then test it 
against both the latest 1.6.x release and the 1.8.x release.  If there is still 
a performance problem, we can look into it.

        Quincey

On Dec 13, 2011, at 1:01 PM, Zaak Beekman wrote:

> Balint,
> I am not sure whether pre-allocation will help the performance but there is a 
> good chance it may since the default for chunked data sets is to allocate 
> space incrementally (chunk by chunk) as data is written to the data set, 
> especially if the chunks are small and there are a lot of them. If matlab has 
> access to the low level HDF5 APIs (which I beleive it does) you can use the 
> H5Pset_alloc_time and pass alloc_time as H5D_ALLOC_TIME_EARLY to set a 
> dataset creation property list. There should be no need to mess with the fill 
> value or do any filling as far as I can tell. 
> 
> You will need to create a property list first then set this property then 
> pass it in to H5Dcreate. Also, I think matlab splits the HDF5 API into 
> classes so the function might look like H5P.set_alloc_time or something like 
> that. It might also be worth while to check that matlab is a recent version 
> so that it is compiled/linked against a recent HDF5 version/build.
> 
> Documentation for H5Pset_alloc_time may be found here: 
> http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetAllocTime
> 
> Good luck,
> Izaak Beekman
> ===================================
> (301)244-9367
> Princeton University Doctoral Candidate
> Mechanical and Aerospace Engineering
> [email protected]
> 
> UMD-CP Visiting Graduate Student
> Aerospace Engineering
> [email protected]
> [email protected]
> 
> 
> On Tue, Dec 13, 2011 at 12:00 PM, <[email protected]> wrote:
> Send Hdf-forum mailing list submissions to
>        [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> or, via email, send a message with subject or body 'help' to
>        [email protected]
> 
> You can reach the person managing the list at
>        [email protected]
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Hdf-forum digest..."
> 
> Today's Topics:
> 
>   1. Datasets keep old names after parent has been renamed?
>      (Darren Dale)
>   2. speeding up write of chunked HDF (Balint Takacs)
> 
> 
> ---------- Forwarded message ----------
> From: Darren Dale <[email protected]>
> To: HDF Users Discussion List <[email protected]>
> Cc: 
> Date: Mon, 12 Dec 2011 12:29:18 -0500
> Subject: [Hdf-forum] Datasets keep old names after parent has been renamed?
> (Apologies if this gets posted twice)
> 
> Someone reported a bug at the h5py issue tracker:
> 
> ---
> import h5py
> 
> # test setup
> fid = h5py.File('test.hdf5', 'w')
> 
> g = fid.create_group('old_loc')
> g2 = g.create_group('group')
> d = g.create_dataset('dataset', data=0)
> 
> print "before move:"
> print g2.name
> print d.name
> 
> # now rename toplevel group
> g.parent.id.move('old_loc', 'new_loc')
> 
> print "after move:"
> # old parent remains in dataset name, group is ok
> print g2.name
> print d.name
> 
> # parent is accessed by name 'g' which does not exist any more
> d.parent
> 
> fid.close()
> ---
> 
> That script produces the following output:
> 
> ---
> before move:
> /old_loc/group
> /old_loc/dataset
> after move:
> /new_loc/group
> /old_loc/dataset
> Traceback (most recent call last):
>  File "move_error.py", line 24, in <module>
>   d.parent
>  File 
> "/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/base.py",
> line 144, in parent
>   return self.file[posixpath.dirname(self.name)]
>  File 
> "/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/group.py",
> line 128, in __getitem__
>   oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
>  File "h5o.pyx", line 176, in h5py.h5o.open (h5py/h5o.c:2814)
> KeyError: "unable to open object (Symbol table: Can't open object)"
> ---
> 
> g.name and d.name simply return the result of h5i.get_name.
> 
> d.parent just splits d.name at the last "/" and returns the the first
> part of the split.
> 
> g.parent.id.move calls H5Gmove2. I've read the warnings about
> corrupting data using H5Gmove at
> http://www.hdfgroup.org/HDF5/doc1.6/Groups.html#H5GUnlinkToCorrupt ,
> but the situation described there does not appear to be relevant to
> the problem we are seeing. Is h5py not performing the move properly,
> or could this be a bug in HDF5?
> 
> Thanks,
> Darren
> 
> 
> 
> 
> ---------- Forwarded message ----------
> From: Balint Takacs <[email protected]>
> To: [email protected]
> Cc: 
> Date: Tue, 13 Dec 2011 12:37:08 +0000
> Subject: [Hdf-forum] speeding up write of chunked HDF
> Hi all,
> 
> I need to fill a huge 3D array, chunked in its second dimension. My data are 
> coming as slices with a fixed index in the third dimension, so the layout 
> needs to be re-ordered. The chunks are uncompressed. When the data is read, 
> the access pattern sweeps through it in the second dimension, so the chunking 
> layout makes sense. The data is stored on an SSD, so random access should be 
> relatively fast. I cannot manipulate data index order.
> 
> In theory, when filling up the array, the data could be continuously written 
> if it were to be stored in a raw file. However, with HDF this becomes 
> painfully slow. The only way I found to speed this up somewhat is to read as 
> much slices I can into memory, and then write together in batches, but I 
> still experience <2MB/sec write transfers on average.
> 
> The file is gradually growing as the slices are added. If this expansion 
> requires re-ordering the entire data, this could explain the slow write 
> speed. I was wondering whether pre-allocating the entire file somehow could 
> help with this, and what is the best way to do it. I could not find any 
> related API function. I know the entire data size before the data collection 
> starts.
> 
> The only idea I have so far is to fill the array with some dummy value (not 
> the fill one) by sweeping through the chunking dimension before adding the 
> slices. This would probably grow the file to its final size rapidly, but I am 
> not sure that this helps at all, and is definitely ugly.
> 
> I am using MATLAB 2007a with the 1.6.5 HDF library it is coming with.
> 
> Thank you for you comments in advance.
> 
> Regards,
> 
> Balint
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to