Hi All,

I've been learning pHDF5 by way of developing a toy application that
checkpoints and restores its state. The restore function was the last to be
implemented, but I realized after doing so that I have an issue: since each
process has strided blocks of data that it is responsible for, the number
of blocks of data saved during one run may not be evenly distributed among
processes in another run, as the mpi_size of the latter run may not evenly
divide the total number of blocks.

I was hoping that a fill value might save me here, and just read in 0s if I
try reading beyond the end of the dataset. Although, I believe I did see a
page noting that this isn't possible for contiguous datasets.

The good news is that since I'm working with 1-dimenional data, it is
fairly easy to refactor relevant code.

The error I get emits this message:

[brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2
perfectNumbers

HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 0:
  #000: ../../src/H5Dio.c line 179 in H5Dread(): can't read data
    major: Dataset
    minor: Read failed
  #001: ../../src/H5Dio.c line 446 in H5D__read(): src and dest data spaces
have different sizes
    major: Invalid arguments to routine
    minor: Bad value
perfectNumbers: perfectNumbers.c:382: restore: Assertion `status != -1'
failed.
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3717 on node euca-128-84-11-180
exited on signal 11 (Segmentation fault).

Here is the offending line
<https://github.com/cornell-comp-internal/CR-demos/blob/3d7ac426b041956b860a0b83c36b88024a64ac1c/demos/pHDF5/perfectNumbers.c#L380>in
the restore function; you can observe the checkpoint function to see how
things are written out to disk.

General pointers are appreciated as well - to paraphrase the problem more
simply: I have a distributed (strided) array I write out to disk as a
dataset among n processes, and when I restart the program, I may want to
divvy up the data among m processes in similar datastructures as before,
but now m != n. Actually, my problem may be different than just this, since
I seem to get the same issue even when m == n ... hmm.

Thanks,
-- 
Brandon E. Barker
http://www.cac.cornell.edu/barker/
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to