I've been doing some experimenting in an attempt to determine whether there is an optimal read/write size and what it is, but I find that there are questions I need answers that I haven't found yet... Hopefully someone here can provide :-)

Our data is going to be in datasets that consist of anywhere from ~1K to ~1M to ~10M compound types of relatively small (~170 bytes) size... My read tests suggest that the optimal number of records to read from the file at once is between 2048 and 4096... The data was compressed for these tests.

What drives that optimal read size? For these tests, only one dataset was written, so I assume that the data was all contiguous.

Chunking puzzles me. At first, I thought it was a number of bytes (I haven't been able to find any documentation that explicitly says whether it's a number of bytes, a number of records, or what), but now I'm not sure. Again, I did some experiments and found that there was a bit of extra overhead with a chunk size of 1, but there really wasn't much difference between a chunk size of 128, 512, or 2048 (in terms of writing speed, mind you, there's definitely a difference in file size). That said, when I tried the same test using a chunk size of 10240 and it slowed down enough that I didn't bother letting it finish. After playing a bit more, it seems the largest chunk size I can pick (in whatever units it happens to be in), is 6553, with it completing in a reasonable time frame (processing time increases by two orders of magnitude going from 6553 to 6554).

So what drives optimal chunking size, if your concern is 1) reading quickly, and 2) writing quickly, in that order. Obviously, the files are a lot smaller with the larger chunk sizes, but why does the processing time suddenly skyrocket going from 6553 to 6554? What units is the chunk size specified in?

Thanks for any answers!


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to