Hi Quincey, Thanks for the quick response. Currently, each core is handling its datasets with a chunk size equal to the size of the local data (the dims parameter in h5pset_chunk_f is equal to to dims parameter in h5dwrite_f) because the local arrays are not that large anyway (in the order of 20x20x20 reals), so if I understand things correctly I'm already using maximum chunk size.
Do you have an idea why it doesn't crash the first time I try to do it though? It's a different array, but of the same size and datatype as the second. As far as I can see I'm closing all used handles at the end of my function at least. Kind regards, Stefan Frijters > Hi Stefan, > > On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote: > >> Dear all, >> >> Recently, I've run into a problem with my parallel HDF5 writes. My >> program works fine on 8k cores, but when I run it on 16k cores it >> crashes when writing a data file through h5dwrite_f(...). >> File writes go through one function in the code, so it always uses the >> same code, but for some reason I don't understand it writes one file >> without problems, but the second one throws the following error message: >> >> Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in >> MPI_Gather: Invalid buffer pointer, error stack: >> MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE, >> rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed >> MPI_Gather(675): Null buffer pointer >> >> I've been looking through the HDF5 source code and it only seems to call >> MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that >> function HDF tries to allocate a receive buffer using >> >> recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size); >> >> Which then returns the null pointer seen in rbuf=(nil) instead of a >> valid pointer. Thus, to me it seems it's HDF causing the problem and not >> MPI. >> >> This problem occurs in both collective and independent IO mode. >> >> Do you have any idea what might be causing this problem, or how to >> resolve it? I'm not sure what kind of other information you might need, >> but I'll do my best to supply it, if you need any. > > This is a scalability problem we are aware of and are working to address, > but in the meanwhile, can you increase the size of your chunks for your > dataset(s)? (That will reduce the number of chunks and the size of the > buffer being allocated) > > Quincey > _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
