Re: [Hdf-forum] New "Chunking in HDF5" document

Werner Benger Tue, 22 Feb 2011 14:31:31 -0800

On Tue, 22 Feb 2011 21:29:03 +0100, Quincey Koziol <[email protected]>wrote:

Hi John,
On Feb 22, 2011, at 4:16 AM, Biddiscombe, John A. wrote:
Does the hdfgroup have any kind of plan/schedule for enablingcompression of chunks when using parallel IO?
It's on my agenda for the first year of work that we will be startingsoon for LBNL. I think it's feasible for independent I/O, with somework. I think collective I/O will probably require a differentapproach, however. At least with collective I/O, all the processes areavailable to communicate and work on things together...
The problem with the collective I/O [write] operations is that multipleprocesses may be writing into each chunk, which MPI-I/O can handle whenthe data is not compressed, but since compressed data iscontext-sensitive, straightforward collective I/O won't work forcompressed chunks. Perhaps a two-phase approach where the data for eachchunk was shipped to a single process, which updated the data in thechunk and compressed it, followed by 1+ passes of collective writes ofcompressed chunks.
The problem with independent I/O [write] operations is that compressedchunks [almost always] change size when the data in the chunk is written(either initially, or when the data is overwritten), and since all theprocesses aren't available, communicating the space allocation is aproblem. Each process needs to allocate space in the file, but sincethe other processes aren't "listening", it can't let them know that somespace in the file has been used. A possible solution to this mightinvolve just appending data to the end of the file, but that's prone torace conditions between processes (although maybe the "shared filepointer" I/O mode in MPI-I/O would help this). Also, if each processmoves a chunk around in the file (because it resized it), how will otherprocesses learn where that chunk is, if they need to read from it?
The use case being that each process compresses its own chunk at writetime and the overall file size is reduced.(I understand that chunks are preallocated and this makes it hard toimplement compressed chunking with Parallel IO).
        Some other ideas that we've been kicking around recently are:
- Using a lossy compressor (like a wavelet encoder) to put a fixed upperlimit on the size of each chunk, making them all the same size. Thiswill obviously affect the precision of the data stored and thus may notbe a good solution for restart dumps, although it might be fine forvisualization/plot files. It's great from the perspective that itcompletely eliminates the space allocation problem, though.
- Use a lossless compressor (like gzip), but put an upper limit on thecompressed size of a chunk, something that's likely to be achievable,like 2:1 or so. Then, if each chunk can't be compressed to that size,have the I/O operation fail. This eliminates the space allocationissue, but at the cost of possibly not being able to write compresseddata at all.
- Alternatively, use a lossless compressor with an upper limit on thecompressed size of a chunk, but also allow for chunks that aren't ableto be compressed to the goal ratio to be stored uncompressed. So, thedataset will only have two sizes of chunks: full-size chunks andhalf-size (or third-size, etc) chunks, which limits the space allocationcomplexities involved. I'm not certain this buys much in the way ofbenefits, since it doesn't eliminate space allocation, and probablywouldn't address the space allocation problems with independent I/O.
        Any other ideas or input?

Maybe HDF5 could allocate some space for the uncompressed data, and if thecompressed data don't use all that space, re-use leftover space for otherpurposes within the same processor, similar to a sparse matrix. This wouldnot reduce the file size when writing the first dataset, but subsequentwritings could benefit from it, as will a h5copy of the final datasetlater (if copying is an option).


         Werner

                Quincey
Thanks

JB


-----Original Message-----
From: [email protected][mailto:[email protected]] On Behalf Of Frank Baker
Sent: 04 October 2010 22:31
To: HDF Users Discussion List
Subject: [Hdf-forum] New "Chunking in HDF5" document
Several users have raised questions regarding chunking in HDF5. Partlyin response to these questions, the initial draft of a new "Chunking inHDF5" document is now available on The HDF Group's website:
   http://www.hdfgroup.org/HDF5/doc/_topic/Chunking/

This draft includes sections on the following topics:
   General description of chunks
   Storage and access order
   Partial I/O
   Chunk caching
   I/O filters and compression
   Pitfalls and errors to avoid
   Additional Resources
   Future directions
Several suggestions for tuning chunking in an application are providedalong the way.
As a draft, this remains a work in progress; your feedback will beappreciated and will be very useful in the document's development. Forexample, let us know if there are additional questions that you wouldlike to see treated.
Regards,
-- Frank Baker
  HDF Documentation
  [email protected]



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org



--
___________________________________________________________________________
Dr. Werner Benger                Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] New "Chunking in HDF5" document

Reply via email to