Thanks Elena,
Apologies below for using "chunk" in a different way (e.g. chunk_counter;
MPI_CHUNK_SIZE) than it is used in HDF5; perhaps I should call them "slabs".
Code from the checkpoint procedure (seems to work):
// dataset and memoryset dimensions (just 1d
here)
hsize_t dimsm[] = {chunk_counter * MPI_CHUNK_SIZE};
hsize_t dimsf[] = {dimsm[0] * mpi_size};
hsize_t maxdims[] = {H5S_UNLIMITED};
hsize_t chunkdims[] = {1};
// hyperslab offset and size info
*/
hsize_t start[] = {mpi_rank * MPI_CHUNK_SIZE};
hsize_t count[] = {chunk_counter};
hsize_t block[] = {MPI_CHUNK_SIZE};
hsize_t stride[] = {MPI_CHUNK_SIZE * mpi_size};
dset_plist_create_id = H5Pcreate (H5P_DATASET_CREATE);
status = H5Pset_chunk (dset_plist_create_id, RANK, chunkdims);
dset_id = H5Dcreate (file_id, DATASETNAME, big_int_h5, filespace,
H5P_DEFAULT, dset_plist_create_id, H5P_DEFAULT);
assert(dset_id != HDF_FAIL);
H5Sselect_hyperslab(filespace, H5S_SELECT_SET,
start, stride, count, block);
Code from the restore procedure (this is where the problem is):
// dataset and memoryset dimensions (just 1d
here)
hsize_t dimsm[1];
hsize_t dimsf[1];
// hyperslab offset and size
info
hsize_t start[] = {mpi_rank * MPI_CHUNK_SIZE};
hsize_t count[1];
hsize_t block[] = {MPI_CHUNK_SIZE};
hsize_t stride[] = {MPI_CHUNK_SIZE * mpi_size};
//
// Update dimensions and dataspaces as
appropriate
//
chunk_counter = get_restore_chunk_counter(dimsf[0]); // Number of chunks
previously used plus enough new chunks to be divisible by mpi_size.
count[0] = chunk_counter;
dimsm[0] = chunk_counter * MPI_CHUNK_SIZE;
dimsf[0] = dimsm[0] * mpi_size;
status = H5Dset_extent(dset_id, dimsf);
assert(status != HDF_FAIL);
//
// Create the memspace for the dataset and allocate data for
it
//
memspace = H5Screate_simple(RANK, dimsm, NULL);
perf_diffs = alloc_and_init(perf_diffs, dimsm[0]);
* H5Sselect_hyperslab(filespace, H5S_SELECT_SET, start, stride, count,
block);*
Complete example code:
https://github.com/cornell-comp-internal/CR-demos/blob/bc507264fe4040d817a2e9603dace0dc06585015/demos/pHDF5/perfectNumbers.c
Best,
The complete example is here:
On Thu, May 28, 2015 at 3:43 PM, Elena Pourmal <[email protected]>
wrote:
> Hi Brandon,
>
> The error message indicates that a hyperslab selection goes beyond
> dataset extent.
>
> Please make sure that you are using the correct values for the start,
> stride, count and block parameters in the H5Sselect_hyperslab call (if you
> use it!). It will help if you provide an excerpt from your code that
> selects hyperslabs for each process.
>
> Elena
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Elena Pourmal The HDF Group http://hdfgroup.org
> 1800 So. Oak St., Suite 203, Champaign IL 61820
> 217.531.6112
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
>
> On May 28, 2015, at 1:46 PM, Brandon Barker <[email protected]>
> wrote:
>
> I believe I've gotten a bit closer by using chunked datasets
> <https://github.com/cornell-comp-internal/CR-demos/blob/bc507264fe4040d817a2e9603dace0dc06585015/demos/pHDF5/perfectNumbers.c>,
> but I'm now not sure how to get past this:
>
> [brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2
> ./perfectNumbers
>
> m, f, count,: 840, 1680, 84
> m, f, count,: 840, 1680, 84
> HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 1:
> #000: ../../src/H5Dio.c line 158 in H5Dread(): selection+offset not
> within extent
> major: Dataspace
> minor: Out of range
> perfectNumbers: perfectNumbers.c:399: restore: Assertion `status != -1'
> failed.
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 28420 on node
> euca-128-84-11-180 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
>
> (m,f,count) represent the memory space and dataspace lengths and the
> count of strided segments to be read in; prior to using set extents as
> follows, I would get the error when f was not a multiple of m
> dimsf[0] = dimsm[0] * mpi_size;
> H5Dset_extent(dset_id, dimsf);
>
> Now that I am using these
> <https://github.com/cornell-comp-internal/CR-demos/blob/bc507264fe4040d817a2e9603dace0dc06585015/demos/pHDF5/perfectNumbers.c#L351>,
> I note that it doesn't seem to have helped the issue, so there must be
> something else I still need to do.
>
> Incidentally, I was looking at this example
> <https://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/examples/h5_extend.c>
> and am not sure what the point of the following code is since rank_chunk is
> never used:
> if (H5D_CHUNKED == H5Pget_layout (prop))
> rank_chunk = H5Pget_chunk (prop, rank, chunk_dimsr);
>
> I guess it is just to demonstrate the function call of H5Pget_chunk?
>
> On Thu, May 28, 2015 at 10:27 AM, Brandon Barker <
> [email protected]> wrote:
>
>> Hi All,
>>
>> I have fixed (and pushed the fix for) one bug that related to an
>> improperly defined count in the restore function. I still have issues for m
>> != n:
>>
>> #000: ../../src/H5Dio.c line 158 in H5Dread(): selection+offset not
>> within extent
>> major: Dataspace
>> minor: Out of range
>>
>> I believe this is indicative of me needing to use chunked datasets so
>> that my dataset can grow in size dynamically.
>>
>> On Wed, May 27, 2015 at 5:03 PM, Brandon Barker <
>> [email protected]> wrote:
>>
>>> Hi All,
>>>
>>> I've been learning pHDF5 by way of developing a toy application that
>>> checkpoints and restores its state. The restore function was the last to be
>>> implemented, but I realized after doing so that I have an issue: since each
>>> process has strided blocks of data that it is responsible for, the number
>>> of blocks of data saved during one run may not be evenly distributed among
>>> processes in another run, as the mpi_size of the latter run may not evenly
>>> divide the total number of blocks.
>>>
>>> I was hoping that a fill value might save me here, and just read in 0s
>>> if I try reading beyond the end of the dataset. Although, I believe I did
>>> see a page noting that this isn't possible for contiguous datasets.
>>>
>>> The good news is that since I'm working with 1-dimenional data, it is
>>> fairly easy to refactor relevant code.
>>>
>>> The error I get emits this message:
>>>
>>> [brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2
>>> perfectNumbers
>>>
>>> HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 0:
>>> #000: ../../src/H5Dio.c line 179 in H5Dread(): can't read data
>>> major: Dataset
>>> minor: Read failed
>>> #001: ../../src/H5Dio.c line 446 in H5D__read(): src and dest data
>>> spaces have different sizes
>>> major: Invalid arguments to routine
>>> minor: Bad value
>>> perfectNumbers: perfectNumbers.c:382: restore: Assertion `status != -1'
>>> failed.
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 3717 on node
>>> euca-128-84-11-180 exited on signal 11 (Segmentation fault).
>>>
>>> Here is the offending line
>>> <https://github.com/cornell-comp-internal/CR-demos/blob/3d7ac426b041956b860a0b83c36b88024a64ac1c/demos/pHDF5/perfectNumbers.c#L380>in
>>> the restore function; you can observe the checkpoint function to see how
>>> things are written out to disk.
>>>
>>> General pointers are appreciated as well - to paraphrase the problem
>>> more simply: I have a distributed (strided) array I write out to disk as a
>>> dataset among n processes, and when I restart the program, I may want to
>>> divvy up the data among m processes in similar datastructures as before,
>>> but now m != n. Actually, my problem may be different than just this, since
>>> I seem to get the same issue even when m == n ... hmm.
>>>
>>> Thanks,
>>> --
>>> Brandon E. Barker
>>> http://www.cac.cornell.edu/barker/
>>>
>>
>>
>>
>> --
>> Brandon E. Barker
>> http://www.cac.cornell.edu/barker/
>>
>
>
>
> --
> Brandon E. Barker
> http://www.cac.cornell.edu/barker/
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>
>
--
Brandon E. Barker
http://www.cac.cornell.edu/barker/
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5