Re: [Hdf-forum] Parallel I/O Compression

Jerome BENOIT Mon, 14 Jan 2013 06:07:21 -0800

You may have a look here:

http://www.hdfgroup.org/services/filters.html


On 14/01/13 12:15, Robert Seigel wrote:

Hi Elena,

Thank you for the clarification. Following on from Jerome's suggestion of 
installing my own compression algorithm (or library) within the routines where 
I call HDF5, is it possible to use H5Zregister and H5Pset_filter to define 
another compression type (e.g. bzip2)? Or would this still not work in 
parallel? If not, I guess a possible solution (though not efficient) would be 
to rewrite the file using compression after it has been created, whereby each 
dataset is one chunk, assuming there is adequate memory to hold each dataset 
without parallelization. Has anyone done something similar to this?

As for your bigger question, I can give you some intel regarding atmospheric models like 
the one I am currently working on. Generally, these models use parallelization to break 
up three dimensional grids (x,y,z) into subdomains of vertical columns, where every 
processor has its own portion of the atmosphere (the vertical coordinate is not usually 
subdivided) that can then be integrated in time. Every so often, the full grids need to 
be written out for postprocessing and analysis (or, conversely, read in to the model for 
a history restart, etc.). This is where most atmospheric models would have a similar 
approach to what I am doing, where each subdomain writes its own "chunk" of the 
atmosphere as a hyperslab of the larger dataset. The number of datasets is usually large 
(my model has ~ 250 2D, 3D, and 4D domains that are subdivided in x and y). For large 
simulations that require parallelization, each file size can be 10's of Gb, amounting to 
many Tb's for one simulation even
when compressed; so compression is necessary! I hope this helps.

Thanks again,
Rob


On Sun, Jan 13, 2013 at 7:24 PM, Elena Pourmal <[email protected] 
<mailto:[email protected]>> wrote:

    Hi Robert and Jerome,

    Sequential HDF5 library can write and read compressed data. Parallel HDF5 
can read a compressed dataset using several processes, but cannot write to a 
compressed dataset.

    Writing compressed data in parallel is a feature that is often requested, 
but unfortunately we do not have funding to implement it. But before (or 
actually after ;-) talking about funding, we really need to gather requirements 
for this feature.

    All,

    Enabling writing of compressed data in a parallel HDF5 library will require 
a lot of prototyping and a substantial development effort. We would like to 
hear from you if you think the feature is absolutely critical for your 
application. We also like to learn more about the writing patterns your 
application uses.

    In Robert's example each process writes a chunk of an HDF5 dataset. This 
special case may be a little-bit easy to address than a general case when data 
from a chunk may be distributed among several processes. It would be good to 
know if this particular scenario is common. What are other commonly used I/O 
patterns?

    Knowing more about the I/O patterns will help us to understand the approach 
we might take in going forward with the design and implementation of the 
feature of writing an HDF5 compressed dataset in parallel (and the cost, of 
course!)

    Thank you!

    Elena
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Elena Pourmal  The HDF Group http://hdfgroup.org
    1800 So. Oak St., Suite 203, Champaign IL 61820
    217.531.6112 <tel:217.531.6112>
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



    On Jan 13, 2013, at 2:09 PM, Jerome BENOIT wrote:

    Hello,

    On 13/01/13 18:37, Robert Seigel wrote:

    Thank you for the response Jerome. Is this not an HDF5 issue because it is 
not possible with HDF5? I would rather not have to compress the .h5 file after 
it has been created.


    HDF5 can compress data: there is a default compressor (gzip) and you can 
use your own one (through some code):
    code examples can be found for bzip2.
    If you are a confident C coder, you can easily implement xz compress, and 
you can certainly implement a parallel version
    of those codes. (I use my own bzip2 and xz compress codes within HDF5, but 
I have not yet parallelized them
    by lake of time)

    I guess it is a bad idea to compress h5 files: it is better to compress 
within.
    Note that you can drastically improve the compression rate by using properly
    some filters on the data.

    To the pigz and pbzip2, pxz can be added.

    Jerome

    Rob


    On Sun, Jan 13, 2013 at 11:10 AM, Jerome BENOIT <[email protected] 
<mailto:[email protected]> <mailto:[email protected]>> wrote:



       On 13/01/13 16:38, Robert Seigel wrote:

           Hello,

           I currently am writing collectively to an HDF5 file in parallel 
using chunks, where each processor writes its subdomain as a chunk of a full 
dataset. I have this working correctly using hyperslabs, however the file size 
is very large [about 18x larger than if it was created using sequential HDF5 
and a H5Pset_deflate(plist_id,6)]. If I try to apply this routine to the 
property list while performing parallel I/O, HDF5 says that this feature is not 
yet supported (I am using v1.8.10). Is there any way to compress the file 
during parallel write?


       This is rather a compressing issue than a HDF5 one:
       you may look for parallel versions of current compressors (pigz, pbzip2, 
...).

       hth,
       Jerome



           Thank you,
           Rob


           This body part will be downloaded on demand.


       _________________________________________________
       Hdf-forum is for HDF software users discussion.
    [email protected] <mailto:[email protected]> 
<mailto:[email protected]>
    http://mail.hdfgroup.org/__mailman/listinfo/hdf-forum___hdfgroup.org 
<http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org>


    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [email protected] <mailto:[email protected]>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org



    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [email protected] <mailto:[email protected]>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org




This body part will be downloaded on demand.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Parallel I/O Compression

Reply via email to