Re: [Hdf-forum] Strategy for pHDF5 collective reads/writs on variable sized communicators

Elena Pourmal Thu, 28 May 2015 12:44:46 -0700

Hi Brandon,

The error message indicates that a hyperslab selection goes beyond dataset 
extent.


Please make sure that you are using the correct values for the start, stride, 
count and block parameters in the H5Sselect_hyperslab call (if you use it!).  
It will help if you provide an excerpt from your code that selects hyperslabs 
for each process.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal  The HDF Group  http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




On May 28, 2015, at 1:46 PM, Brandon Barker 
<[email protected]<mailto:[email protected]>> wrote:

I believe I've gotten a bit closer by using chunked 
datasets<https://github.com/cornell-comp-internal/CR-demos/blob/bc507264fe4040d817a2e9603dace0dc06585015/demos/pHDF5/perfectNumbers.c>,
 but I'm now not sure how to get past this:

[brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2 ./perfectNumbers
m, f, count,: 840, 1680, 84
m, f, count,: 840, 1680, 84
HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 1:
  #000: ../../src/H5Dio.c line 158 in H5Dread(): selection+offset not within 
extent
    major: Dataspace
    minor: Out of range
perfectNumbers: perfectNumbers.c:399: restore: Assertion `status != -1' failed.
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 28420 on node euca-128-84-11-180 
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


(m,f,count) represent the memory space and dataspace lengths and the count of 
strided segments to be read in; prior to using set extents as follows, I would 
get the error when f was not a multiple of m
dimsf[0] = dimsm[0] * mpi_size;
H5Dset_extent(dset_id, dimsf);

Now that I am using 
these<https://github.com/cornell-comp-internal/CR-demos/blob/bc507264fe4040d817a2e9603dace0dc06585015/demos/pHDF5/perfectNumbers.c#L351>,
 I note that it doesn't seem to have helped the issue, so there must be 
something else I still need to do.

Incidentally, I was looking at this 
example<https://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/examples/h5_extend.c>
 and am not sure what the point of the following code is since rank_chunk is 
never used:
    if (H5D_CHUNKED == H5Pget_layout (prop))
       rank_chunk = H5Pget_chunk (prop, rank, chunk_dimsr);

I guess it is just to demonstrate the function call of H5Pget_chunk?

On Thu, May 28, 2015 at 10:27 AM, Brandon Barker 
<[email protected]<mailto:[email protected]>> wrote:
Hi All,

I have fixed (and pushed the fix for) one bug that related to an improperly 
defined count in the restore function. I still have issues for m != n:

  #000: ../../src/H5Dio.c line 158 in H5Dread(): selection+offset not within 
extent
    major: Dataspace
    minor: Out of range

I believe this is indicative of me needing to use chunked datasets so that my 
dataset can grow in size dynamically.

On Wed, May 27, 2015 at 5:03 PM, Brandon Barker 
<[email protected]<mailto:[email protected]>> wrote:
Hi All,

I've been learning pHDF5 by way of developing a toy application that 
checkpoints and restores its state. The restore function was the last to be 
implemented, but I realized after doing so that I have an issue: since each 
process has strided blocks of data that it is responsible for, the number of 
blocks of data saved during one run may not be evenly distributed among 
processes in another run, as the mpi_size of the latter run may not evenly 
divide the total number of blocks.

I was hoping that a fill value might save me here, and just read in 0s if I try 
reading beyond the end of the dataset. Although, I believe I did see a page 
noting that this isn't possible for contiguous datasets.

The good news is that since I'm working with 1-dimenional data, it is fairly 
easy to refactor relevant code.

The error I get emits this message:

[brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2 perfectNumbers
HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 0:
  #000: ../../src/H5Dio.c line 179 in H5Dread(): can't read data
    major: Dataset
    minor: Read failed
  #001: ../../src/H5Dio.c line 446 in H5D__read(): src and dest data spaces 
have different sizes
    major: Invalid arguments to routine
    minor: Bad value
perfectNumbers: perfectNumbers.c:382: restore: Assertion `status != -1' failed.
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3717 on node euca-128-84-11-180 
exited on signal 11 (Segmentation fault).

Here is the offending line 
<https://github.com/cornell-comp-internal/CR-demos/blob/3d7ac426b041956b860a0b83c36b88024a64ac1c/demos/pHDF5/perfectNumbers.c#L380>
 in the restore function; you can observe the checkpoint function to see how 
things are written out to disk.

General pointers are appreciated as well - to paraphrase the problem more 
simply: I have a distributed (strided) array I write out to disk as a dataset 
among n processes, and when I restart the program, I may want to divvy up the 
data among m processes in similar datastructures as before, but now m != n. 
Actually, my problem may be different than just this, since I seem to get the 
same issue even when m == n ... hmm.

Thanks,
--
Brandon E. Barker
http://www.cac.cornell.edu/barker/



--
Brandon E. Barker
http://www.cac.cornell.edu/barker/



--
Brandon E. Barker
http://www.cac.cornell.edu/barker/
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]<mailto:[email protected]>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Strategy for pHDF5 collective reads/writs on variable sized communicators

Reply via email to