b)      promote these blocks from datasets to chunks, so that the hdf library 
was responsible for the virtual addressing and did all the real work at 
retrieval time.

it seems like hdf already does everything we want if we had b) in place. once 
the chunks are on disk and indexed correctly, a user selecting a slab will 
trigger retrieval of the chunks and as long as the decompression filter is 
available, handle that too. There'd be no need for a virtual dataset to map 
access to the sub-datasets underneath.

      Hmm, so you'd have some new "bind" operation that took as input a bunch 
of datasets and bound them together as a new dataset?

Essentially yes, I had something quite intrusive in mind. What I was thinking 
was that each process independently creates a dataset and compresses it (it 
could be just a memory buffer rather than an hdf5 dataset). Collectively a new 
dataset is created which has the correct extents for the whole data, chunks are 
'requested' by each process and instead of allowing hdf to manage the chunks 
and allocate them, we intercept the chunk generation/allocation (override it) 
and simply supply our own, using our compressed data buffer. hdf then does all 
the book keeping as usual and writes/flushes the data to disk. Providing the 
chunk extents are regular, the compressed data could vary in final size from 
chunk to chunk, (some tidying up might be necessary in the chunk code).

On load, the user can treat the data as a completely normal dataset, but 
compressed.

I suspect this is what you meant too, but I thought I'd spell it out more 
clearly just in case. I will start poking around with the chunking code to see 
if I can intercept things at convenient places. Please stop me if you think I'm 
pursuing a bad idea.

JB

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to