Re: [Hdf-forum] Strategy for pHDF5 collective reads/writs on variable sized communicators

Brandon Barker Thu, 28 May 2015 07:29:27 -0700

Hi All,

I have fixed (and pushed the fix for) one bug that related to an improperly
defined count in the restore function. I still have issues for m != n:


  #000: ../../src/H5Dio.c line 158 in H5Dread(): selection+offset not
within extent
    major: Dataspace
    minor: Out of range

I believe this is indicative of me needing to use chunked datasets so that
my dataset can grow in size dynamically.

On Wed, May 27, 2015 at 5:03 PM, Brandon Barker <[email protected]>
wrote:

> Hi All,
>
> I've been learning pHDF5 by way of developing a toy application that
> checkpoints and restores its state. The restore function was the last to be
> implemented, but I realized after doing so that I have an issue: since each
> process has strided blocks of data that it is responsible for, the number
> of blocks of data saved during one run may not be evenly distributed among
> processes in another run, as the mpi_size of the latter run may not evenly
> divide the total number of blocks.
>
> I was hoping that a fill value might save me here, and just read in 0s if
> I try reading beyond the end of the dataset. Although, I believe I did see
> a page noting that this isn't possible for contiguous datasets.
>
> The good news is that since I'm working with 1-dimenional data, it is
> fairly easy to refactor relevant code.
>
> The error I get emits this message:
>
> [brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2
> perfectNumbers
>
> HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 0:
>   #000: ../../src/H5Dio.c line 179 in H5Dread(): can't read data
>     major: Dataset
>     minor: Read failed
>   #001: ../../src/H5Dio.c line 446 in H5D__read(): src and dest data
> spaces have different sizes
>     major: Invalid arguments to routine
>     minor: Bad value
> perfectNumbers: perfectNumbers.c:382: restore: Assertion `status != -1'
> failed.
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 3717 on node
> euca-128-84-11-180 exited on signal 11 (Segmentation fault).
>
> Here is the offending line
> <https://github.com/cornell-comp-internal/CR-demos/blob/3d7ac426b041956b860a0b83c36b88024a64ac1c/demos/pHDF5/perfectNumbers.c#L380>in
> the restore function; you can observe the checkpoint function to see how
> things are written out to disk.
>
> General pointers are appreciated as well - to paraphrase the problem more
> simply: I have a distributed (strided) array I write out to disk as a
> dataset among n processes, and when I restart the program, I may want to
> divvy up the data among m processes in similar datastructures as before,
> but now m != n. Actually, my problem may be different than just this, since
> I seem to get the same issue even when m == n ... hmm.
>
> Thanks,
> --
> Brandon E. Barker
> http://www.cac.cornell.edu/barker/
>



-- 
Brandon E. Barker
http://www.cac.cornell.edu/barker/

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Strategy for pHDF5 collective reads/writs on variable sized communicators

Reply via email to