We have changed the lustre mount options (added -flock) appropriately and that has alleviated the previous errors we were seeing. Many thanks to Rob Latham for confirming that this was the appropriate course of action. Now, however, I am seeing some new errors, still very cryptic and less than informative. The errors, attached here, mention memory corruption and memory mapping issues. This is rather ominous.
As Rob pointed out, however, we are using an old version of mvapich, mvapich-1.1.0-qlc. (We need this to support legacy codes.) I also recall some discussion in the HDF5 documentation about the prevalence of bugs in MPI implementations. Could this be an issue with our MPI implementation? Here are some more details about when the bug shows up: The bug/error only occurs when performing collective h5dwrite_f operations when some of the processes have selected null hyperslabs (h5sselect_none_f) some of the time (it appears to depend on the topology of MPI ranks with null selections). If the data transfer property is set to individual, and only the MPI ranks with data to write make calls to h5dwrite_f then the data is written successfully. Individual IO is prohibitively slow, more than 2 orders of magnitude slower than collective IO and would cause this portion of the IO to take as long, or longer than the computation portion of the simulation. Lastly I want to thank everyone one on the list for their patience with me--I have been asking for a lot of help recently, but your responses have been incredibly helpful. Thank you all so much, Izaak Beekman =================================== (301)244-9367 Princeton University Doctoral Candidate Mechanical and Aerospace Engineering [email protected] UMD-CP Visiting Graduate Student Aerospace Engineering [email protected] [email protected]
error
Description: Binary data
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
