Dear all, Recently, I've run into a problem with my parallel HDF5 writes. My program works fine on 8k cores, but when I run it on 16k cores it crashes when writing a data file through h5dwrite_f(...). File writes go through one function in the code, so it always uses the same code, but for some reason I don't understand it writes one file without problems, but the second one throws the following error message:
Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in MPI_Gather: Invalid buffer pointer, error stack: MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE, rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed MPI_Gather(675): Null buffer pointer I've been looking through the HDF5 source code and it only seems to call MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that function HDF tries to allocate a receive buffer using recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size); Which then returns the null pointer seen in rbuf=(nil) instead of a valid pointer. Thus, to me it seems it's HDF causing the problem and not MPI. This problem occurs in both collective and independent IO mode. Do you have any idea what might be causing this problem, or how to resolve it? I'm not sure what kind of other information you might need, but I'll do my best to supply it, if you need any. Kind regards, Stefan Frijters _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
