Hi Stefan,

On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:

> Dear all,
> 
> Recently, I've run into a problem with my parallel HDF5 writes. My
> program works fine on 8k cores, but when I run it on 16k cores it
> crashes when writing a data file through h5dwrite_f(...).
> File writes go through one function in the code, so it always uses the
> same code, but for some reason I don't understand it writes one file
> without problems, but the second one throws the following error message:
> 
> Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
> MPI_Gather: Invalid buffer pointer, error stack:
> MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
> rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
> MPI_Gather(675): Null buffer pointer
> 
> I've been looking through the HDF5 source code and it only seems to call
> MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
> function HDF tries to allocate a receive buffer using
> 
> recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);
> 
> Which then returns the null pointer seen in rbuf=(nil) instead of a
> valid pointer. Thus, to me it seems it's HDF causing the problem and not
> MPI.
> 
> This problem occurs in both collective and independent IO mode.
> 
> Do you have any idea what might be causing this problem, or how to
> resolve it? I'm not sure what kind of other information you might need,
> but I'll do my best to supply it, if you need any.

        This is a scalability problem we are aware of and are working to 
address, but in the meanwhile, can you increase the size of your chunks for 
your dataset(s)?  (That will reduce the number of chunks and the size of the 
buffer being allocated)

        Quincey


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to