Re: [Hdf-forum] combining several datasets to make one large dataset

Ryan Price Mon, 22 Aug 2011 09:37:23 -0700

Thanks Dan.  This was very helpful.

Ryan


On Fri, Aug 19, 2011 at 9:40 AM, Daniel Kahn <[email protected]> wrote:

> **
> Ryan,
>
>
> Ryan Price wrote:
>
> I have a 4GB binary dump of data that I'd like to store as a hdf5 dataset
> (using command line tools if possible).  The data set will have the
> dimensions 31486448 x 128.  I believe this is too big to import as a data
> set in one go.
>
>
> 4GB is a large array.  You may wish to give some thought to how the data
> will be used after you have created the file.  Will the end user really
> process all 4GB at once?  HDF5 provides chunking and compression
> functionality which (transparently to the data reader, and almost
> transparently to the writer) will store the data in "chunks" and compress
> them as well, if you'd like.  If you can make the chunk size close to amount
> of data the end user will want to access it can be very convenient.
>
> Here is a piece of Python I wrote to demonstrate creating a file with
> chunking and compression.  I was able to open the file and view the dataset
> properties in HDFView, but not the dataset itself because the array is so
> large.  You can use this code if the major axis of your binary dump is in
> the 128 direction.  If it is in the other direction you'll probably want to
> choose different chunking parameters, and read the binary data off disk
> appropriately.   (By the way, I get a 3.8MB file since the array contains a
> single value and I've turned on compression.)
>
> import numpy; import h5py
>
> with h5py.File('BigArray.h5','w') as fid:
>     dset = fid.create_dataset('BigArray',shape=(31486448,
> 128),dtype='int8',chunks=(31486448, 1), compression='gzip')
>     slicearray = numpy.ones([31486448],dtype='int8') # one-D array, of
> length 31486448
>     for i in range(128):
>         # replace this comment with read from binary file into slicearray
>         print "Writing",i
>         dset[:,i] = slicearray # populate slice i of the HDF5 dataset.
>
> Cheers,
> --dan
>
> Running h5import gives the following error:
> Unable to allocate dynamic memory.
> Error in allocating unsigned integer data storage.
> Error in reading the input file: my_data
> Program aborted.
>
> So I split the binary dump into four files which can be imported.  I'd
> still like to have one 31486448 x 128 dataset but am not sure that's
> possible to do.
>
> Any idea how I could combine these four binary dumps into one data set.
> Maybe create a single dataset and append each small one...?
>
> Thanks,
>
> Ryan
>
> ------------------------------
>
> _______________________________________________
> Hdf-forum is for HDF software users 
> [email protected]http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>
> --
> Daniel Kahn
> Science Systems and Applications Inc.301-867-2162
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] combining several datasets to make one large dataset

Reply via email to