Hello everyone,

I run simulations on a cluster (using OpenMPI) with a Lustre filesystem
and I use HDF5 1.8.9 for data output. Each process has its own file, so
I believe there is no need for the parallel HDF5 version, is this correct?

When a larger number (> 4) processes want to dump their data at the same
time, I get various errors of paths and objects not found or any other
operation failing. I can't really make out the reason for it, as the
code works fine on my personal workstation and runs for days with writes
/ reads every 5 minutes without failing.

What I have tried so far is having one process manage all the read/write
operations so that all other processes have to check whether anyone else
is already dumping their data. I also implemented
boost::interprocess:file_lock to prevent writing in the same file, which
is however excluded by the queuing system anyway, so this was more of a
paranoid move to be absolutely sure. All that helped reducing the number
fatal errors significantly, but did not completely get rid of them. The
biggest problem is, that some of the files get corrupted when the
program crashes which is especially inconvenient.

My question is, if there is any obvious mistake I am making and how I
would go about solving this issue. My initial guess is that the Lustre
filesystem plays some role in this, since it is the only difference to
my personal computer where everything runs smoothly. As I said, neither
the errors messages nor the traceback show any consistency.

bye, Peter


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to