Re: [Hdf-forum] Strategy for pHDF5 collective reads/writs on variable sized communicators

Brandon Barker Thu, 28 May 2015 11:48:04 -0700

I believe I've gotten a bit closer by using chunked datasets
<https://github.com/cornell-comp-internal/CR-demos/blob/bc507264fe4040d817a2e9603dace0dc06585015/demos/pHDF5/perfectNumbers.c>,
but I'm now not sure how to get past this:


[brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2
./perfectNumbers

m, f, count,: 840, 1680, 84
m, f, count,: 840, 1680, 84
HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 1:
  #000: ../../src/H5Dio.c line 158 in H5Dread(): selection+offset not
within extent
    major: Dataspace
    minor: Out of range
perfectNumbers: perfectNumbers.c:399: restore: Assertion `status != -1'
failed.
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 28420 on node
euca-128-84-11-180 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


(m,f,count) represent the memory space and dataspace lengths and the count
of strided segments to be read in; prior to using set extents as follows, I
would get the error when f was not a multiple of m
dimsf[0] = dimsm[0] * mpi_size;
H5Dset_extent(dset_id, dimsf);

Now that I am using these
<https://github.com/cornell-comp-internal/CR-demos/blob/bc507264fe4040d817a2e9603dace0dc06585015/demos/pHDF5/perfectNumbers.c#L351>,
I note that it doesn't seem to have helped the issue, so there must be
something else I still need to do.

Incidentally, I was looking at this example
<https://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/examples/h5_extend.c>
and am not sure what the point of the following code is since rank_chunk is
never used:
    if (H5D_CHUNKED == H5Pget_layout (prop))
       rank_chunk = H5Pget_chunk (prop, rank, chunk_dimsr);

I guess it is just to demonstrate the function call of H5Pget_chunk?

On Thu, May 28, 2015 at 10:27 AM, Brandon Barker <[email protected]
> wrote:

> Hi All,
>
> I have fixed (and pushed the fix for) one bug that related to an
> improperly defined count in the restore function. I still have issues for m
> != n:
>
>   #000: ../../src/H5Dio.c line 158 in H5Dread(): selection+offset not
> within extent
>     major: Dataspace
>     minor: Out of range
>
> I believe this is indicative of me needing to use chunked datasets so that
> my dataset can grow in size dynamically.
>
> On Wed, May 27, 2015 at 5:03 PM, Brandon Barker <
> [email protected]> wrote:
>
>> Hi All,
>>
>> I've been learning pHDF5 by way of developing a toy application that
>> checkpoints and restores its state. The restore function was the last to be
>> implemented, but I realized after doing so that I have an issue: since each
>> process has strided blocks of data that it is responsible for, the number
>> of blocks of data saved during one run may not be evenly distributed among
>> processes in another run, as the mpi_size of the latter run may not evenly
>> divide the total number of blocks.
>>
>> I was hoping that a fill value might save me here, and just read in 0s if
>> I try reading beyond the end of the dataset. Although, I believe I did see
>> a page noting that this isn't possible for contiguous datasets.
>>
>> The good news is that since I'm working with 1-dimenional data, it is
>> fairly easy to refactor relevant code.
>>
>> The error I get emits this message:
>>
>> [brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2
>> perfectNumbers
>>
>> HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 0:
>>   #000: ../../src/H5Dio.c line 179 in H5Dread(): can't read data
>>     major: Dataset
>>     minor: Read failed
>>   #001: ../../src/H5Dio.c line 446 in H5D__read(): src and dest data
>> spaces have different sizes
>>     major: Invalid arguments to routine
>>     minor: Bad value
>> perfectNumbers: perfectNumbers.c:382: restore: Assertion `status != -1'
>> failed.
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 3717 on node
>> euca-128-84-11-180 exited on signal 11 (Segmentation fault).
>>
>> Here is the offending line
>> <https://github.com/cornell-comp-internal/CR-demos/blob/3d7ac426b041956b860a0b83c36b88024a64ac1c/demos/pHDF5/perfectNumbers.c#L380>in
>> the restore function; you can observe the checkpoint function to see how
>> things are written out to disk.
>>
>> General pointers are appreciated as well - to paraphrase the problem more
>> simply: I have a distributed (strided) array I write out to disk as a
>> dataset among n processes, and when I restart the program, I may want to
>> divvy up the data among m processes in similar datastructures as before,
>> but now m != n. Actually, my problem may be different than just this, since
>> I seem to get the same issue even when m == n ... hmm.
>>
>> Thanks,
>> --
>> Brandon E. Barker
>> http://www.cac.cornell.edu/barker/
>>
>
>
>
> --
> Brandon E. Barker
> http://www.cac.cornell.edu/barker/
>



-- 
Brandon E. Barker
http://www.cac.cornell.edu/barker/

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Strategy for pHDF5 collective reads/writs on variable sized communicators

Reply via email to