I encounter a sudden drop of I/O bandwidth when the number of datasets in a
single group exceeds around 1.7 million. In the following I describe the
issue in more detail.

I'm converting an adaptive mesh refinement data to HDF5 format. Each
dataset contains a small 4-D array with a size of ~ 10 KB in the compact
format. All datasets are stored in the same group. When the total number of
datasets (N) is smaller than ~ 1.7 million, I get an I/O bandwidth of ~100
MB/s, which is acceptable. However, when N exceeds ~ 1.7 million, the
bandwidth suddenly drops by at least one to two orders of magnitude.

This issue seems to relate to the **number of datasets per group** instead
of total data size. For example, if I reduce the size of each dataset by a
factor of 5 (so ~2 KB per dataset), the I/O bandwidth stills drops when N >
~ 1.7 million, even though the total data size is reduced by a factor of 5.

So I was wondering what causes this issue, and if there is any simple
solution to that. Since the data stored in different datasets are
independent to each other, I prefer not to combine them into a larger
dataset. My current solution is to further create several HDF5 sub-groups
under the main group, and then distribute all datasets evenly in these
sub-groups (so that the number of datasets per group becomes smaller). By
doing so the I/O bandwidth becomes stable even when N > 1.7 million.

If necessary, I can post a simplified code to reproduce this issue.

Hsi-Yu
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to