Re: [Hdf-forum] New "Chunking in HDF5" document

Quincey Koziol Tue, 22 Feb 2011 12:33:36 -0800

Hi John,

On Feb 22, 2011, at 4:16 AM, Biddiscombe, John A. wrote:


> Does the hdfgroup have any kind of plan/schedule for enabling compression of 
> chunks when using parallel IO?

        It's on my agenda for the first year of work that we will be starting 
soon for LBNL.  I think it's feasible for independent I/O, with some work.  I 
think collective I/O will probably require a different approach, however.  At 
least with collective I/O, all the processes are available to communicate and 
work on things together...

        The problem with the collective I/O [write] operations is that multiple 
processes may be writing into each chunk, which MPI-I/O can handle when the 
data is not compressed, but since compressed data is context-sensitive, 
straightforward collective I/O won't work for compressed chunks.  Perhaps a 
two-phase approach where the data for each chunk was shipped to a single 
process, which updated the data in the chunk and compressed it, followed by 1+ 
passes of collective writes of compressed chunks.

        The problem with independent I/O [write] operations is that compressed 
chunks [almost always] change size when the data in the chunk is written 
(either initially, or when the data is overwritten), and since all the 
processes aren't available, communicating the space allocation is a problem.  
Each process needs to allocate space in the file, but since the other processes 
aren't "listening", it can't let them know that some space in the file has been 
used.  A possible solution to this might involve just appending data to the end 
of the file, but that's prone to race conditions between processes (although 
maybe the "shared file pointer" I/O mode in MPI-I/O would help this).  Also, if 
each process moves a chunk around in the file (because it resized it), how will 
other processes learn where that chunk is, if they need to read from it?

> The use case being that each process compresses its own chunk at write time 
> and the overall file size is reduced. 
> (I understand that chunks are preallocated and this makes it hard to 
> implement compressed chunking with Parallel IO).

        Some other ideas that we've been kicking around recently are:

- Using a lossy compressor (like a wavelet encoder) to put a fixed upper limit 
on the size of each chunk, making them all the same size.  This will obviously 
affect the precision of the data stored and thus may not be a good solution for 
restart dumps, although it might be fine for visualization/plot files.  It's 
great from the perspective that it completely eliminates the space allocation 
problem, though.

- Use a lossless compressor (like gzip), but put an upper limit on the 
compressed size of a chunk, something that's likely to be achievable, like 2:1 
or so.  Then, if each chunk can't be compressed to that size, have the I/O 
operation fail.  This eliminates the space allocation issue, but at the cost of 
possibly not being able to write compressed data at all.

- Alternatively, use a lossless compressor with an upper limit on the 
compressed size of a chunk, but also allow for chunks that aren't able to be 
compressed to the goal ratio to be stored uncompressed.  So, the dataset will 
only have two sizes of chunks: full-size chunks and half-size (or third-size, 
etc) chunks, which limits the space allocation complexities involved.  I'm not 
certain this buys much in the way of benefits, since it doesn't eliminate space 
allocation, and probably wouldn't address the space allocation problems with 
independent I/O.

        Any other ideas or input?

                Quincey

> Thanks
> 
> JB
> 
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] 
> On Behalf Of Frank Baker
> Sent: 04 October 2010 22:31
> To: HDF Users Discussion List
> Subject: [Hdf-forum] New "Chunking in HDF5" document
> 
> 
> Several users have raised questions regarding chunking in HDF5.  Partly in 
> response to these questions, the initial draft of a new "Chunking in HDF5" 
> document is now available on The HDF Group's website:
>    http://www.hdfgroup.org/HDF5/doc/_topic/Chunking/
> 
> This draft includes sections on the following topics:
>    General description of chunks
>    Storage and access order
>    Partial I/O
>    Chunk caching
>    I/O filters and compression
>    Pitfalls and errors to avoid
>    Additional Resources
>    Future directions
> Several suggestions for tuning chunking in an application are provided along 
> the way.
> 
> As a draft, this remains a work in progress; your feedback will be 
> appreciated and will be very useful in the document's development.  For 
> example, let us know if there are additional questions that you would like to 
> see treated.
> 
> Regards,
> -- Frank Baker
>   HDF Documentation
>   [email protected]
> 
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] New "Chunking in HDF5" document

Reply via email to