I've been doing some experimenting in an attempt to determine whether
there is an optimal read/write size and what it is, but I find that
there are questions I need answers that I haven't found yet... Hopefully
someone here can provide :-)
Our data is going to be in datasets that consist of anywhere from ~1K to
~1M to ~10M compound types of relatively small (~170 bytes) size... My
read tests suggest that the optimal number of records to read from the
file at once is between 2048 and 4096... The data was compressed for
these tests.
What drives that optimal read size? For these tests, only one dataset
was written, so I assume that the data was all contiguous.
Chunking puzzles me. At first, I thought it was a number of bytes (I
haven't been able to find any documentation that explicitly says whether
it's a number of bytes, a number of records, or what), but now I'm not
sure. Again, I did some experiments and found that there was a bit of
extra overhead with a chunk size of 1, but there really wasn't much
difference between a chunk size of 128, 512, or 2048 (in terms of
writing speed, mind you, there's definitely a difference in file size).
That said, when I tried the same test using a chunk size of 10240 and it
slowed down enough that I didn't bother letting it finish. After
playing a bit more, it seems the largest chunk size I can pick (in
whatever units it happens to be in), is 6553, with it completing in a
reasonable time frame (processing time increases by two orders of
magnitude going from 6553 to 6554).
So what drives optimal chunking size, if your concern is 1) reading
quickly, and 2) writing quickly, in that order. Obviously, the files
are a lot smaller with the larger chunk sizes, but why does the
processing time suddenly skyrocket going from 6553 to 6554? What units
is the chunk size specified in?
Thanks for any answers!
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org