Dear all,

Recently, I've run into a problem with my parallel HDF5 writes. My
program works fine on 8k cores, but when I run it on 16k cores it
crashes when writing a data file through h5dwrite_f(...).
File writes go through one function in the code, so it always uses the
same code, but for some reason I don't understand it writes one file
without problems, but the second one throws the following error message:

Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
MPI_Gather: Invalid buffer pointer, error stack:
MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
MPI_Gather(675): Null buffer pointer

I've been looking through the HDF5 source code and it only seems to call
MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
function HDF tries to allocate a receive buffer using

recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);

Which then returns the null pointer seen in rbuf=(nil) instead of a
valid pointer. Thus, to me it seems it's HDF causing the problem and not
MPI.

This problem occurs in both collective and independent IO mode.

Do you have any idea what might be causing this problem, or how to
resolve it? I'm not sure what kind of other information you might need,
but I'll do my best to supply it, if you need any.

Kind regards,

Stefan Frijters

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to