Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Stefan Frijters Wed, 24 Mar 2010 08:13:12 -0700

Hi Quincey,

Thanks for the quick response. Currently, each core is handling its
datasets with a chunk size equal to the size of the local data (the dims
 parameter in h5pset_chunk_f is equal to to dims parameter in
h5dwrite_f)  because the local arrays are not that large anyway (in the
order of 20x20x20 reals), so if I understand things correctly I'm
already using maximum chunk size.


Do you have an idea why it doesn't crash the first time I try to do it
though? It's a different array, but of the same size and datatype as the
second. As far as I can see I'm closing all used handles at the end of
my function at least.

Kind regards,

Stefan Frijters

> Hi Stefan,
> 
> On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:
> 
>> Dear all,
>> 
>> Recently, I've run into a problem with my parallel HDF5 writes. My
>> program works fine on 8k cores, but when I run it on 16k cores it
>> crashes when writing a data file through h5dwrite_f(...).
>> File writes go through one function in the code, so it always uses the
>> same code, but for some reason I don't understand it writes one file
>> without problems, but the second one throws the following error message:
>> 
>> Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
>> MPI_Gather: Invalid buffer pointer, error stack:
>> MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
>> rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
>> MPI_Gather(675): Null buffer pointer
>> 
>> I've been looking through the HDF5 source code and it only seems to call
>> MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
>> function HDF tries to allocate a receive buffer using
>> 
>> recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);
>> 
>> Which then returns the null pointer seen in rbuf=(nil) instead of a
>> valid pointer. Thus, to me it seems it's HDF causing the problem and not
>> MPI.
>> 
>> This problem occurs in both collective and independent IO mode.
>> 
>> Do you have any idea what might be causing this problem, or how to
>> resolve it? I'm not sure what kind of other information you might need,
>> but I'll do my best to supply it, if you need any.
> 
> This is a scalability problem we are aware of and are working to address,
> but in the meanwhile, can you increase the size of your chunks for your 
> dataset(s)?  (That will reduce the number of chunks and the size of the
> buffer being allocated)
> 
>       Quincey
> 



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Reply via email to