Hello everyone, I run simulations on a cluster (using OpenMPI) with a Lustre filesystem and I use HDF5 1.8.9 for data output. Each process has its own file, so I believe there is no need for the parallel HDF5 version, is this correct?
When a larger number (> 4) processes want to dump their data at the same time, I get various errors of paths and objects not found or any other operation failing. I can't really make out the reason for it, as the code works fine on my personal workstation and runs for days with writes / reads every 5 minutes without failing. What I have tried so far is having one process manage all the read/write operations so that all other processes have to check whether anyone else is already dumping their data. I also implemented boost::interprocess:file_lock to prevent writing in the same file, which is however excluded by the queuing system anyway, so this was more of a paranoid move to be absolutely sure. All that helped reducing the number fatal errors significantly, but did not completely get rid of them. The biggest problem is, that some of the files get corrupted when the program crashes which is especially inconvenient. My question is, if there is any obvious mistake I am making and how I would go about solving this issue. My initial guess is that the Lustre filesystem plays some role in this, since it is the only difference to my personal computer where everything runs smoothly. As I said, neither the errors messages nor the traceback show any consistency. bye, Peter _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
