On Tue, Dec 14, 2010 at 5:42 PM, Quincey Koziol <[email protected]> wrote:

> Hi Leigh,
>
>
 [snipped for brevity]

Quincey,
>
> Probably a combination of both, namely, an ideal situation would be a group
> of MPI ranks collectively writing one compressed HDF5 file. On Blue Waters a
> 100kcore run with 32 cores/MCM could therefore result in say around 3000
> files, which is not unreasonable.
>
> Maybe I'm thinking about this too simply, but couldn't you compress the
> data on each MPI rank, save it in a buffer, calculate the space required,
> and the write it? I don't know enough about the internal workings of hdf5 to
> know whether that would fit in the HDF5 model. In our particular application
> on Blue Waters, memory is cheap, so there is lots of space in memory for
> buffering data.
>
>
> What you say above is basically what happens, except that space in the file
> needs to be allocated for each block of compressed data.  Since each block
> is not the same size, the HDF5 library can't pre-allocate the space or
> algorithmically determine how much to reserve for each process.  In the case
> of collective I/O, at least it's theoretically possible for all the
> processes to communicate and work it out, but I'm not certain it's going to
> be solvable for independent I/O, unless we reserve some processes to either
> allocate space (like a "free space server") or buffer the "I/O", etc.
>

Could you make this work by forcing each core to have some specific chunking
arrangement? For instance, you could have each core's dimension simply be
the same dimension as each chunk, which actually works out pretty well in my
application, at least in the horizontal. I typically have nxchunk=nx,
nychunk=ny, and nzchunk to be something like 20 or so. But - now that I
think about it, even if that were the case you don't know the size of the
compressed chunks until you've compressed them and you'd still need to
communicate the size of the compressed chunks amongst cores writing to an
individual file.

I don't know enough about hdf5 to understand how the preallocation process
works. It sounds like you are allocating a bunch of zeroes (or something) on
disk first, and then doing I/O straight to that space on disk? If this is
the case then I can see how this necessitates some kind of collective
communication if you are splitting up compression amongst MPI ranks.

Personally I am perfectly happy with a bit of overhead which forces all
cores to share amongst themselves what the compressed block size is before
writing if it means we can do compression. Right now I see my choices as
being (1) compressed, but 1 file per MPI rank, lots of files (2) No
compression, fewer files, but perhaps compressing later on using h5repack,
calling it in parallel, one h5repack per MPI rank as a post-processing step
(yuck!).

I'm glad you're working on this, personally I think this is important stuff
for really huge simulations. In talking to other folks who will be using
Blue Waters, compression is not much of an issue with many of them because
of the nature of their data. Cloud data especially tends to compress very
well. It would be a shame to fill terabytes of disk space with zeroes! I am
sure we can still carry out our research objectives without compression, but
the sheer amount of data we will be producing is staggering even with
compression.

Leigh


> Quincey
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>


-- 
Leigh Orf
Associate Professor of Atmospheric Science
Department of Geology and Meteorology
Central Michigan University
Currently on sabbatical at the National Center for Atmospheric Research
in Boulder, CO
NCAR office phone: (303) 497-8200
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to