[Hdf-forum] Chunking size and compression for large datasets

Bhamidipati, Vikram Mon, 04 Apr 2016 05:48:38 -0700

Hello,

I have been reading some posts in this forum about chunking in HDF5 and ways of 
optimizing read on large datasets. The emphasis is on read time optimization 
rather than write time optimization because of the way the code works. I have a 
large 3D dataset of native double datatype. The dimensions are m by n by p (in 
row major storage) where m is the time dimension size, n is the grid point 
index size and p is the field quantity component size (always fixed).


The first access use case is as follows:

1.       The graphics visualization requests data for the whole grid at a given 
time index and component index. It would seem like the best case here would be 
to read a single hyperslab with stride length equal to size of field components 
in 3rd dimension and equal to 1 in grid point index.  Count is 1 along time 
index and field quantity index.

2.       When time index changes, the start value for 1st dimension (time 
index) is changed while all other parameters stay constant.

3.       When field component changes the start value for 3rd dimension (field 
component index) is changed.

The second use case is as follows:

1.       The algorithm chooses certain grid points of interest (not necessarily 
adjacent in memory) and the code requests a "time history" for those grid 
points which include all field components for each node. So the request is for 
data over all time indices and all components over a small subset (negligible 
fraction of total) of grid points. The best case here would seem to be to read 
a union of hyperslabs where union is over the 2nd dimension indices.

2.       Step 1 is repeated for various locations in the grid many more times. 
Since multiple threads are running and making these requests independently, 
there is no possibility of further union of hyperslabs.

It would seem like the two read access patterns have somewhat conflicting needs 
since first access pattern has constant 1st dimension whereas second access 
pattern has constant 2nd dimension for each hyperslab. If pushed to make a 
choice I would optimize second read pattern over first since it critically 
affects execution time. I am also intending on using  'H5S_SELECT_OR' selection 
operator for union of hyperslabs before h5dread call. As previously mentioned, 
the write time is not very critical but read access time is. So my questions 
are:


1.       What chunking sizes would work best? I am planning on using m by 1 by 
p chunk size when writing the dataset provided m is large enough to push the 
chunk size over 1 Mb. If smaller I would increase 2nd dimension size. Is this 
the right strategy?

2.       Can I set cache size for the dataset to optimize read time? I read 
about using h5pset_chunk_cache. Since I know how many bytes each chunk I am 
going to request is, should I set the cache size to number of grid points times 
data size for each grid point? Also is this function needed only during read 
(since write time optimization is not an issue)?

3.       What compression method if any should be used? I read in a tutorial on 
chunking 
(https://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/Chunking_Tutorial_EOS13_2009.pdf)
 that if compression is used it does not matter what cache size is. Is this 
correct? Why? I did not understand the explanation in the tutorial that since 
entire chunk is always read (when compression is used) for each h5dread call, 
cache size does not matter? Any clarification on how compression helps optimize 
read time will be helpful.

4.       Also while writing the dataset can I force chunks to be contiguous in 
memory to reduce any seek times?

Thank you,
Vikram

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

[Hdf-forum] Chunking size and compression for large datasets

Reply via email to