I was also getting the same error with MOAB from ANL when we were benchmarking small mesh reads with large number of processors. When I ran on 16384 processes the job would terminate with:
Out of memory in file /bgsys/source/srcV1R2M1.17463/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c, line 1073 A semi-discussion about the problem can be found here: http://lists.mpich.org/pipermail/devel/2013-May/000154.html We did not have time in the project to look into the problem any further. Scot On Sep 1, 2015, at 9:34 AM, Wolf Dapp <[email protected]<mailto:[email protected]>> wrote: Dear forum members, this may be too specialized a problem, but maybe somebody still has some insights. Our code (running on an IBM BlueGene/Q machine) reads in some data, using HDF5. This is done collectively, on each core (everyone reads in the same data, at the same time). It is not known a priori which processor owns which part of the data, they have to compute this themselves and discard the data they don't own. The data file is ~9.4MB in a simple test case. The data is a custom data type of a nested struct with two 32-bit integers and two 64-bit doubles that form a complex number, with a total of 192 bits. If I use less than 1024 cores, there is no problem. However, for >=1024 cores, I get a crash with the error "Out of memory in file /bgsys/source/srcV1R2M3.12428/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_rdcoll.c, line 1073" We use parallel HDF5 1.8.15; I've also tried 1.8.14. Another library dependence is FFTW 3.3.3, but that should not really matter. I traced the crash with Totalview to the call of H5Dread(). The second-to-last call in the crash trace is MPIDO_Alltoallv, the last one is PAMI_Context_trylock_advancev. I don't have exact calls nor line numbers since the HDF5 library was not compiled with debug symbols. [the file mentioned in the error message is not accessible] Is this an HDF5 problem, or a problem with IBM's MPI implementation? Might it be an MPI buffer overflow?!? Or is there maybe a problem with data contiguity in the struct? The problem disappears if I read in the file in chunks of less than 192kiB at a time. A more workable workaround is to replace collective communication by independent communication, in which case, the problem disappears. H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); --> H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT); Since this data file is quite small (usually not larger than a few hundred megabytes at most), reading in the file independently is not a huge performance problem at this stage, but for very large simulations it might be. In other, older parts of the code, we're (successfully!) reading in (up to) 256 GiB of data in predefined data types (double, float) using H5FD_MPIO_COLLECTIVE without any problem, so I'm thinking this problem is connected with the user-defined data type in some way. I attach some condensed code with all calls to the HDF5 library; I'm not sure anyone is in the position to actually reproduce this problem, so the main() routine and the data file are probably unnecessary. However, I'd be happy to also send those if need be. Thanks in advance for any hints. Best regards, Wolf -- <contMech-9.hdf5.cpp>_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected]<mailto:[email protected]> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
