Hi Jed, 

My first attempt at sanitising chunking.

The rationale is to keep the old behaviour if possible, so I just add a if-
else after the chunk sizes are first calculated and change those sizes that 
need to be changed. [There is also a zeroing of the 6-element arrays in the 
beginning of the function because I multiply them together to find the various 
sizes and at least Cray seems to have random numbers in the stack when 
entering the function, causing the sizes to be crazy.]

I first check if the average vector size exceeds 4 GiB and if it does, simply 
use as few chunks as possible, splitting along slowest varying axis first and 
if that is not enough, splitting the next axis, too etc. The patch *can* split 
the fastest varying dimension, too, but that seems overkill: it would mean 
that da_grid_x*dof > 4 GiB, which seems unlikely. It cannot split dof.

If the average Vec size is <= 4 GiB, but total Vec size is > 4 GiB, there is a 
simple logic: first see if <local_z_size>*<global_y_size>*<global_x_size>*dof 
is smaller than "target size". Target size comes from your formula earlier in 
this thread. If that is so, use that, if not, then just use the local Vec size 
as the chunk size. This could be improved by having an intermediate size 
solution, too, but this solution gives *my* tests decent performance across 
several global sizes, so it does not seem to be too bad. Now, the performance 
I get is just ~3 GiB/s, using 36 OSTs on a lustre. Using one OST I get ~200 
MiB/s, so the scaling is not that bad: I get half of 36*200 MiB/s, so I think 
my performance is limited by the OSTs, not MPI-IO or HDF5 or chunking. Tests 
with bigger systems would of course be needed to determine whether this 
solution works for truly huge files, too: the biggest I tried are just over 50 
GiB. Because I'm forced to spend my research quota on this, I don't want to 
waste too much of it by going to ridiculous size files (these are about the 
biggest ones I plan to use in practice for now).

And the patch itself is attached, of course. Please improve if you think it 
should be improved and adapt to PETSc coding conventions if necessary (at 
least the #defines are in a funny place to keep the patch simple).

Cheers,
Juha

On Sunday 06 Oct 2013 12:24:58 Jed Brown wrote:
> Juha Jäykkä <[email protected]> writes:
> > Argh, messy indeed. Are you sure you mean 65 k and not 64 Ki?
> 
> I was using 65 as shorthand for 2^{32}, yes.
> 
> > I made a small table of the situation just to make sure I am not
> > missing anything. In the table, "small" means < 4 GB, "large" means >=
> > 4 GB, "few" means < 65 k, "many" means >= 65 k. Note that local size >
> > global size is impossible, but I include the row on the table for
> > completeness's sake.
> > 
> > Variables:  local size      global size     # ranks         chunks
> > 
> >             small           small           few             global size
> >             small           small           many            global size[1]
> >             small           large           few             avg local size
> >             small           large           many            4 GiB
> >             large           small           few             impossible
> >             large           small           many            impossible
> >             large           large           few             4 GiB[2]
> >             large           large           many            65 k chunks
> 
> The last line cannot be stored since it consists of more than 65k chunks
> of size larger than 4 GiB.
> 
> > [1] It sounds improbable anyone would run a problem with < 4 GiB data with
> > >= 65k ranks, but fortunately it's not a problem.
> 
> Au contraire, it's typical for strong scaling to roll over at about 1000
> dofs/core (about 10 kB of data), so 65k ranks is still less than 1 GB.
> 
> > [2] Unless I'm mistaken, this situation will always give < 65 k chunks for
> > 4 GiB chunk size.
> 
> Well, 33k ranks each trying to write 8 GiB would need 66k chunks, but
> there is no way to write this file so we don't need to worry about it.
> 
> > I also believe your formula gives "the right" answer in each case. Just
> > one
> > more question: is "average local size" a good solution or is it better to
> > use "max local size"? The latter will cause more unnecessary data in the
> > file, but unless I'm mistaken, the former will require extra MPI
> > communication to fill in the portions of ranks whose local size is less
> > than average.
> 
> It depends how the compute nodes are connected to the file system, but
> even if you use "average" and if size is statistically uncorrelated with
> rank but has positive variance, the expected value of the skew will be
> more than a rank's contribution for sufficiently large numbers of cores.
> In other words, it's not possible to "balance" without needing to move
> all data for a sufficiently large number of ranks.
> 
> > HDF5 really needs to fix this internally. As it stands, a single HDF5
> > dataset cannot hold more than 260 TiB – not that many people would want
> > such files anyway, but then again, "640 kiB should be enough for
> > everybody", right? I'm running simulations which take more than terabyte
> > of memory, and I'm by far not the biggest memory consumer in the world,
> > so the limit is not really as far as it might seem.
> 
> We're already there.  Today's large machines can fit several vectors of
> size 260 TB in memory.
> 
> >> I think we're planning to tag 3.4.3 in the next couple weeks.  There
> >> might be a 3.4.4 as well, but I could see going straight to 3.5.
> > 
> > Ok. I don't see myself having time to fix and test this in two weeks, but
> > 3.4.4 should be doable. Anyone else want to fix the bug by then?
> 
> I'll write back if I get to it.
-- 
                 -----------------------------------------------
                | Juha Jäykkä, [email protected]                     |
                | http://koti.kapsi.fi/~juhaj/                  |
                 -----------------------------------------------
--- gr2.c.orig	2013-05-14 03:17:51.000000000 +0300
+++ gr2.c	2013-10-21 19:14:11.000000000 +0300
@@ -341,6 +341,15 @@
   const char     *vecname;
   PetscErrorCode ierr;
 
+  // just in case some libraries do not empty memory for local variables:
+  for (dim=0;dim<6;dim++) {
+    chunkDims[dim] = 0;
+    maxDims[dim] = 0;
+    dims[dim] = 0;
+    count[dim] = 0;
+    offset[dim] = 0;
+  }
+
   PetscFunctionBegin;
   ierr = PetscViewerHDF5OpenGroup(viewer, &file_id, &group);CHKERRQ(ierr);
   ierr = PetscViewerHDF5GetTimestep(viewer, &timestep);CHKERRQ(ierr);
@@ -395,6 +405,103 @@
   chunkDims[dim] = dims[dim];
   ++dim;
 #endif
+
+  // some requirements for the chunks (assuming no split along time-step)
+  int ranks;
+  MPI_Comm_size(PETSC_COMM_WORLD, &ranks);
+  hsize_t chunk_size, target_size, dim_save,
+    vec_size = sizeof(double)*da->M*da->N*da->P*da->w,
+    avg_local_vec_size = (hsize_t) ceil(vec_size*1.0/ranks),
+    max_chunks = 65536,
+    max_chunk_size = 4294967296,
+    GiB = 1073741824,
+    MiB = 1048576,
+    min_size = 10245760,
+    zslices=da->p, yslices=da->n, xslices=da->m;
+
+#define fmin3(a,b,c)	fmin(a,fmin(b,c))
+#define fmax3(a,b,c)	fmax(a,fmax(b,c))
+#define max(a,b)        (a>=b?a:b)
+  
+  dim_save=dim;
+  target_size = (hsize_t) ceil(fmin3(vec_size, fmax3(avg_local_vec_size, ceil(vec_size*1.0/max_chunks), min_size), max_chunk_size));
+  chunk_size = (hsize_t) max(1,chunkDims[0])*max(1,chunkDims[1])*max(1,chunkDims[2])*max(1,chunkDims[3])*max(1,chunkDims[4])*max(1,chunkDims[5])*sizeof(PetscScalar);
+
+  // if size/rank > max_chunk_size, we need radical measures: even going down to
+  // avg_local_vec_size is not enough, so we simply use chunk size of 4 GiB no matter
+  // what, composed in the most efficient way possible.
+  // N.B. this minimises the number of chunks, which may or may not be the optimal
+  // solution. In a BG, for example, the optimal solution is probably to make # chunks = #
+  // IO nodes involved, but this author has no access to a BG to figure out how to
+  // reliably find the right number. And even then it may or may not be enough.
+  if (avg_local_vec_size > max_chunk_size) {
+    // check if we can just split local z-axis: is that enough?
+    zslices=(hsize_t) ceil(vec_size*1.0/(da->p*max_chunk_size))*zslices;
+    if (zslices > da->P) {
+      // lattice is too large in xy-directions, splitting z only is not enough
+      zslices = da->P;
+      yslices=(hsize_t) ceil(vec_size*1.0/(zslices*da->n*max_chunk_size))*yslices;
+      if (yslices > da->N) {
+	// lattice is too large in x-direction, splitting along z, y is not enough
+	yslices = da->N;
+	xslices=(hsize_t) ceil(vec_size*1.0/(zslices*yslices*da->m*max_chunk_size))*xslices;
+      }
+    }
+    dim = 0;
+    if (timestep >= 0) {
+      ++dim;
+    }
+    // prefer to split z-axis, even down to planar slices
+    if (da->dim == 3) {
+      chunkDims[dim++] = (hsize_t) floor(da->P*1.0/zslices);
+      chunkDims[dim++] = (hsize_t) floor(da->N*1.0/yslices);
+      chunkDims[dim++] = (hsize_t) floor(da->M*1.0/xslices);
+    } else {
+      // This is a 2D world exceeding 4GiB in size; yes, I've seen them, even used myself
+      chunkDims[dim++] = (hsize_t) floor(da->N*1.0/yslices);
+      chunkDims[dim++] = (hsize_t) floor(da->M*1.0/xslices);
+    }
+    chunk_size = (hsize_t) fmax(1,chunkDims[0])*fmax(1,chunkDims[1])*fmax(1,chunkDims[2])*fmax(1,chunkDims[3])*fmax(1,chunkDims[4])*fmax(1,chunkDims[5])*sizeof(double);
+  } else {
+    if (target_size < chunk_size) {
+      // only change the defaults if target_size < chunk_size
+      dim = 0;
+      if (timestep >= 0) {
+	++dim;
+      }
+      // prefer to split z-axis, even down to planar slices
+      if (da->dim == 3) {
+	// try splitting the z-axis to core-size bits, i.e. divide chunk size by # ranks in z-direction
+	if (target_size >= chunk_size/da->p) {
+	  // just make chunks the size of <local_z>x<whole_world_y>x<whole_world_x>x<dof>
+	  chunkDims[dim] = (hsize_t) ceil(da->P*1.0/da->p);
+	} else {
+	  // oops, just splitting the z-axis is NOT ENOUGH, need to split more; let's be
+	  // radical and let everyone write all they've got
+	  chunkDims[dim++] = (hsize_t) ceil(da->P*1.0/da->p);
+	  chunkDims[dim++] = (hsize_t) ceil(da->N*1.0/da->n);
+	  chunkDims[dim++] = (hsize_t) ceil(da->M*1.0/da->m);
+	}
+      } else {
+	// This is a 2D world exceeding 4GiB in size; yes, I've seen them, even used myself
+	if (target_size >= chunk_size/da->n) {
+	  // just make chunks the size of <local_z>x<whole_world_y>x<whole_world_x>x<dof>
+	  chunkDims[dim] = (hsize_t) ceil(da->N*1.0/da->n);
+	} else {
+	  // oops, just splitting the z-axis is NOT ENOUGH, need to split more; let's be
+	  // radical and let everyone write all they've got
+	  chunkDims[dim++] = (hsize_t) ceil(da->N*1.0/da->n);
+	  chunkDims[dim++] = (hsize_t) ceil(da->M*1.0/da->m);
+	}
+      
+      }
+      chunk_size = (hsize_t) fmax(1,chunkDims[0])*fmax(1,chunkDims[1])*fmax(1,chunkDims[2])*fmax(1,chunkDims[3])*fmax(1,chunkDims[4])*fmax(1,chunkDims[5])*sizeof(double);
+    } else {
+      // precomputed chunks are fine, we don't need to do anything
+    }
+  }
+  dim=dim_save; // need to have the original value here no matter what!
+
   filespace = H5Screate_simple(dim, dims, maxDims);
   if (filespace == -1) SETERRQ(PETSC_COMM_SELF,PETSC_ERR_LIB,"Cannot H5Screate_simple()");
 

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to