b) promote these blocks from datasets to chunks, so that the hdf library
was responsible for the virtual addressing and did all the real work at
retrieval time.
it seems like hdf already does everything we want if we had b) in place. once
the chunks are on disk and indexed correctly, a user selecting a slab will
trigger retrieval of the chunks and as long as the decompression filter is
available, handle that too. There'd be no need for a virtual dataset to map
access to the sub-datasets underneath.
Hmm, so you'd have some new "bind" operation that took as input a bunch
of datasets and bound them together as a new dataset?
Essentially yes, I had something quite intrusive in mind. What I was thinking
was that each process independently creates a dataset and compresses it (it
could be just a memory buffer rather than an hdf5 dataset). Collectively a new
dataset is created which has the correct extents for the whole data, chunks are
'requested' by each process and instead of allowing hdf to manage the chunks
and allocate them, we intercept the chunk generation/allocation (override it)
and simply supply our own, using our compressed data buffer. hdf then does all
the book keeping as usual and writes/flushes the data to disk. Providing the
chunk extents are regular, the compressed data could vary in final size from
chunk to chunk, (some tidying up might be necessary in the chunk code).
On load, the user can treat the data as a completely normal dataset, but
compressed.
I suspect this is what you meant too, but I thought I'd spell it out more
clearly just in case. I will start poking around with the chunking code to see
if I can intercept things at convenient places. Please stop me if you think I'm
pursuing a bad idea.
JB
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org