On Tue, 22 Feb 2011 21:29:03 +0100, Quincey Koziol <[email protected]>
wrote:
Hi John,
On Feb 22, 2011, at 4:16 AM, Biddiscombe, John A. wrote:
Does the hdfgroup have any kind of plan/schedule for enabling
compression of chunks when using parallel IO?
It's on my agenda for the first year of work that we will be starting
soon for LBNL. I think it's feasible for independent I/O, with some
work. I think collective I/O will probably require a different
approach, however. At least with collective I/O, all the processes are
available to communicate and work on things together...
The problem with the collective I/O [write] operations is that multiple
processes may be writing into each chunk, which MPI-I/O can handle when
the data is not compressed, but since compressed data is
context-sensitive, straightforward collective I/O won't work for
compressed chunks. Perhaps a two-phase approach where the data for each
chunk was shipped to a single process, which updated the data in the
chunk and compressed it, followed by 1+ passes of collective writes of
compressed chunks.
The problem with independent I/O [write] operations is that compressed
chunks [almost always] change size when the data in the chunk is written
(either initially, or when the data is overwritten), and since all the
processes aren't available, communicating the space allocation is a
problem. Each process needs to allocate space in the file, but since
the other processes aren't "listening", it can't let them know that some
space in the file has been used. A possible solution to this might
involve just appending data to the end of the file, but that's prone to
race conditions between processes (although maybe the "shared file
pointer" I/O mode in MPI-I/O would help this). Also, if each process
moves a chunk around in the file (because it resized it), how will other
processes learn where that chunk is, if they need to read from it?
The use case being that each process compresses its own chunk at write
time and the overall file size is reduced.
(I understand that chunks are preallocated and this makes it hard to
implement compressed chunking with Parallel IO).
Some other ideas that we've been kicking around recently are:
- Using a lossy compressor (like a wavelet encoder) to put a fixed upper
limit on the size of each chunk, making them all the same size. This
will obviously affect the precision of the data stored and thus may not
be a good solution for restart dumps, although it might be fine for
visualization/plot files. It's great from the perspective that it
completely eliminates the space allocation problem, though.
- Use a lossless compressor (like gzip), but put an upper limit on the
compressed size of a chunk, something that's likely to be achievable,
like 2:1 or so. Then, if each chunk can't be compressed to that size,
have the I/O operation fail. This eliminates the space allocation
issue, but at the cost of possibly not being able to write compressed
data at all.
- Alternatively, use a lossless compressor with an upper limit on the
compressed size of a chunk, but also allow for chunks that aren't able
to be compressed to the goal ratio to be stored uncompressed. So, the
dataset will only have two sizes of chunks: full-size chunks and
half-size (or third-size, etc) chunks, which limits the space allocation
complexities involved. I'm not certain this buys much in the way of
benefits, since it doesn't eliminate space allocation, and probably
wouldn't address the space allocation problems with independent I/O.
Any other ideas or input?
Maybe HDF5 could allocate some space for the uncompressed data, and if the
compressed data don't use all that space, re-use leftover space for other
purposes within the same processor, similar to a sparse matrix. This would
not reduce the file size when writing the first dataset, but subsequent
writings could benefit from it, as will a h5copy of the final dataset
later (if copying is an option).
Werner
Quincey
Thanks
JB
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Frank Baker
Sent: 04 October 2010 22:31
To: HDF Users Discussion List
Subject: [Hdf-forum] New "Chunking in HDF5" document
Several users have raised questions regarding chunking in HDF5. Partly
in response to these questions, the initial draft of a new "Chunking in
HDF5" document is now available on The HDF Group's website:
http://www.hdfgroup.org/HDF5/doc/_topic/Chunking/
This draft includes sections on the following topics:
General description of chunks
Storage and access order
Partial I/O
Chunk caching
I/O filters and compression
Pitfalls and errors to avoid
Additional Resources
Future directions
Several suggestions for tuning chunking in an application are provided
along the way.
As a draft, this remains a work in progress; your feedback will be
appreciated and will be very useful in the document's development. For
example, let us know if there are additional questions that you would
like to see treated.
Regards,
-- Frank Baker
HDF Documentation
[email protected]
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org