Hi Jed, My first attempt at sanitising chunking.
The rationale is to keep the old behaviour if possible, so I just add a if- else after the chunk sizes are first calculated and change those sizes that need to be changed. [There is also a zeroing of the 6-element arrays in the beginning of the function because I multiply them together to find the various sizes and at least Cray seems to have random numbers in the stack when entering the function, causing the sizes to be crazy.] I first check if the average vector size exceeds 4 GiB and if it does, simply use as few chunks as possible, splitting along slowest varying axis first and if that is not enough, splitting the next axis, too etc. The patch *can* split the fastest varying dimension, too, but that seems overkill: it would mean that da_grid_x*dof > 4 GiB, which seems unlikely. It cannot split dof. If the average Vec size is <= 4 GiB, but total Vec size is > 4 GiB, there is a simple logic: first see if <local_z_size>*<global_y_size>*<global_x_size>*dof is smaller than "target size". Target size comes from your formula earlier in this thread. If that is so, use that, if not, then just use the local Vec size as the chunk size. This could be improved by having an intermediate size solution, too, but this solution gives *my* tests decent performance across several global sizes, so it does not seem to be too bad. Now, the performance I get is just ~3 GiB/s, using 36 OSTs on a lustre. Using one OST I get ~200 MiB/s, so the scaling is not that bad: I get half of 36*200 MiB/s, so I think my performance is limited by the OSTs, not MPI-IO or HDF5 or chunking. Tests with bigger systems would of course be needed to determine whether this solution works for truly huge files, too: the biggest I tried are just over 50 GiB. Because I'm forced to spend my research quota on this, I don't want to waste too much of it by going to ridiculous size files (these are about the biggest ones I plan to use in practice for now). And the patch itself is attached, of course. Please improve if you think it should be improved and adapt to PETSc coding conventions if necessary (at least the #defines are in a funny place to keep the patch simple). Cheers, Juha On Sunday 06 Oct 2013 12:24:58 Jed Brown wrote: > Juha Jäykkä <[email protected]> writes: > > Argh, messy indeed. Are you sure you mean 65 k and not 64 Ki? > > I was using 65 as shorthand for 2^{32}, yes. > > > I made a small table of the situation just to make sure I am not > > missing anything. In the table, "small" means < 4 GB, "large" means >= > > 4 GB, "few" means < 65 k, "many" means >= 65 k. Note that local size > > > global size is impossible, but I include the row on the table for > > completeness's sake. > > > > Variables: local size global size # ranks chunks > > > > small small few global size > > small small many global size[1] > > small large few avg local size > > small large many 4 GiB > > large small few impossible > > large small many impossible > > large large few 4 GiB[2] > > large large many 65 k chunks > > The last line cannot be stored since it consists of more than 65k chunks > of size larger than 4 GiB. > > > [1] It sounds improbable anyone would run a problem with < 4 GiB data with > > >= 65k ranks, but fortunately it's not a problem. > > Au contraire, it's typical for strong scaling to roll over at about 1000 > dofs/core (about 10 kB of data), so 65k ranks is still less than 1 GB. > > > [2] Unless I'm mistaken, this situation will always give < 65 k chunks for > > 4 GiB chunk size. > > Well, 33k ranks each trying to write 8 GiB would need 66k chunks, but > there is no way to write this file so we don't need to worry about it. > > > I also believe your formula gives "the right" answer in each case. Just > > one > > more question: is "average local size" a good solution or is it better to > > use "max local size"? The latter will cause more unnecessary data in the > > file, but unless I'm mistaken, the former will require extra MPI > > communication to fill in the portions of ranks whose local size is less > > than average. > > It depends how the compute nodes are connected to the file system, but > even if you use "average" and if size is statistically uncorrelated with > rank but has positive variance, the expected value of the skew will be > more than a rank's contribution for sufficiently large numbers of cores. > In other words, it's not possible to "balance" without needing to move > all data for a sufficiently large number of ranks. > > > HDF5 really needs to fix this internally. As it stands, a single HDF5 > > dataset cannot hold more than 260 TiB – not that many people would want > > such files anyway, but then again, "640 kiB should be enough for > > everybody", right? I'm running simulations which take more than terabyte > > of memory, and I'm by far not the biggest memory consumer in the world, > > so the limit is not really as far as it might seem. > > We're already there. Today's large machines can fit several vectors of > size 260 TB in memory. > > >> I think we're planning to tag 3.4.3 in the next couple weeks. There > >> might be a 3.4.4 as well, but I could see going straight to 3.5. > > > > Ok. I don't see myself having time to fix and test this in two weeks, but > > 3.4.4 should be doable. Anyone else want to fix the bug by then? > > I'll write back if I get to it. -- ----------------------------------------------- | Juha Jäykkä, [email protected] | | http://koti.kapsi.fi/~juhaj/ | -----------------------------------------------
--- gr2.c.orig 2013-05-14 03:17:51.000000000 +0300
+++ gr2.c 2013-10-21 19:14:11.000000000 +0300
@@ -341,6 +341,15 @@
const char *vecname;
PetscErrorCode ierr;
+ // just in case some libraries do not empty memory for local variables:
+ for (dim=0;dim<6;dim++) {
+ chunkDims[dim] = 0;
+ maxDims[dim] = 0;
+ dims[dim] = 0;
+ count[dim] = 0;
+ offset[dim] = 0;
+ }
+
PetscFunctionBegin;
ierr = PetscViewerHDF5OpenGroup(viewer, &file_id, &group);CHKERRQ(ierr);
ierr = PetscViewerHDF5GetTimestep(viewer, ×tep);CHKERRQ(ierr);
@@ -395,6 +405,103 @@
chunkDims[dim] = dims[dim];
++dim;
#endif
+
+ // some requirements for the chunks (assuming no split along time-step)
+ int ranks;
+ MPI_Comm_size(PETSC_COMM_WORLD, &ranks);
+ hsize_t chunk_size, target_size, dim_save,
+ vec_size = sizeof(double)*da->M*da->N*da->P*da->w,
+ avg_local_vec_size = (hsize_t) ceil(vec_size*1.0/ranks),
+ max_chunks = 65536,
+ max_chunk_size = 4294967296,
+ GiB = 1073741824,
+ MiB = 1048576,
+ min_size = 10245760,
+ zslices=da->p, yslices=da->n, xslices=da->m;
+
+#define fmin3(a,b,c) fmin(a,fmin(b,c))
+#define fmax3(a,b,c) fmax(a,fmax(b,c))
+#define max(a,b) (a>=b?a:b)
+
+ dim_save=dim;
+ target_size = (hsize_t) ceil(fmin3(vec_size, fmax3(avg_local_vec_size, ceil(vec_size*1.0/max_chunks), min_size), max_chunk_size));
+ chunk_size = (hsize_t) max(1,chunkDims[0])*max(1,chunkDims[1])*max(1,chunkDims[2])*max(1,chunkDims[3])*max(1,chunkDims[4])*max(1,chunkDims[5])*sizeof(PetscScalar);
+
+ // if size/rank > max_chunk_size, we need radical measures: even going down to
+ // avg_local_vec_size is not enough, so we simply use chunk size of 4 GiB no matter
+ // what, composed in the most efficient way possible.
+ // N.B. this minimises the number of chunks, which may or may not be the optimal
+ // solution. In a BG, for example, the optimal solution is probably to make # chunks = #
+ // IO nodes involved, but this author has no access to a BG to figure out how to
+ // reliably find the right number. And even then it may or may not be enough.
+ if (avg_local_vec_size > max_chunk_size) {
+ // check if we can just split local z-axis: is that enough?
+ zslices=(hsize_t) ceil(vec_size*1.0/(da->p*max_chunk_size))*zslices;
+ if (zslices > da->P) {
+ // lattice is too large in xy-directions, splitting z only is not enough
+ zslices = da->P;
+ yslices=(hsize_t) ceil(vec_size*1.0/(zslices*da->n*max_chunk_size))*yslices;
+ if (yslices > da->N) {
+ // lattice is too large in x-direction, splitting along z, y is not enough
+ yslices = da->N;
+ xslices=(hsize_t) ceil(vec_size*1.0/(zslices*yslices*da->m*max_chunk_size))*xslices;
+ }
+ }
+ dim = 0;
+ if (timestep >= 0) {
+ ++dim;
+ }
+ // prefer to split z-axis, even down to planar slices
+ if (da->dim == 3) {
+ chunkDims[dim++] = (hsize_t) floor(da->P*1.0/zslices);
+ chunkDims[dim++] = (hsize_t) floor(da->N*1.0/yslices);
+ chunkDims[dim++] = (hsize_t) floor(da->M*1.0/xslices);
+ } else {
+ // This is a 2D world exceeding 4GiB in size; yes, I've seen them, even used myself
+ chunkDims[dim++] = (hsize_t) floor(da->N*1.0/yslices);
+ chunkDims[dim++] = (hsize_t) floor(da->M*1.0/xslices);
+ }
+ chunk_size = (hsize_t) fmax(1,chunkDims[0])*fmax(1,chunkDims[1])*fmax(1,chunkDims[2])*fmax(1,chunkDims[3])*fmax(1,chunkDims[4])*fmax(1,chunkDims[5])*sizeof(double);
+ } else {
+ if (target_size < chunk_size) {
+ // only change the defaults if target_size < chunk_size
+ dim = 0;
+ if (timestep >= 0) {
+ ++dim;
+ }
+ // prefer to split z-axis, even down to planar slices
+ if (da->dim == 3) {
+ // try splitting the z-axis to core-size bits, i.e. divide chunk size by # ranks in z-direction
+ if (target_size >= chunk_size/da->p) {
+ // just make chunks the size of <local_z>x<whole_world_y>x<whole_world_x>x<dof>
+ chunkDims[dim] = (hsize_t) ceil(da->P*1.0/da->p);
+ } else {
+ // oops, just splitting the z-axis is NOT ENOUGH, need to split more; let's be
+ // radical and let everyone write all they've got
+ chunkDims[dim++] = (hsize_t) ceil(da->P*1.0/da->p);
+ chunkDims[dim++] = (hsize_t) ceil(da->N*1.0/da->n);
+ chunkDims[dim++] = (hsize_t) ceil(da->M*1.0/da->m);
+ }
+ } else {
+ // This is a 2D world exceeding 4GiB in size; yes, I've seen them, even used myself
+ if (target_size >= chunk_size/da->n) {
+ // just make chunks the size of <local_z>x<whole_world_y>x<whole_world_x>x<dof>
+ chunkDims[dim] = (hsize_t) ceil(da->N*1.0/da->n);
+ } else {
+ // oops, just splitting the z-axis is NOT ENOUGH, need to split more; let's be
+ // radical and let everyone write all they've got
+ chunkDims[dim++] = (hsize_t) ceil(da->N*1.0/da->n);
+ chunkDims[dim++] = (hsize_t) ceil(da->M*1.0/da->m);
+ }
+
+ }
+ chunk_size = (hsize_t) fmax(1,chunkDims[0])*fmax(1,chunkDims[1])*fmax(1,chunkDims[2])*fmax(1,chunkDims[3])*fmax(1,chunkDims[4])*fmax(1,chunkDims[5])*sizeof(double);
+ } else {
+ // precomputed chunks are fine, we don't need to do anything
+ }
+ }
+ dim=dim_save; // need to have the original value here no matter what!
+
filespace = H5Screate_simple(dim, dims, maxDims);
if (filespace == -1) SETERRQ(PETSC_COMM_SELF,PETSC_ERR_LIB,"Cannot H5Screate_simple()");
signature.asc
Description: This is a digitally signed message part.
