Hi Peter,

The problem does sound strange.
I do not understand why file locking helped reduce errors. I though you said each process writes to its own file anyway, so locking the file or having one process manage the reads/writes should not matter anyway.

Is it possible you could send me a piece of code from your simulation that is performing I/O, that I can look at and diagnose further? A program that I can run and replicates the problem (on Lustre) would be great. If that is not possible, then please just describe or copy-paste how you are calling into the HDF5 library for your I/O.

Thanks,
Mohamad

On 11/18/2012 10:24 AM, Peter Boertz wrote:
Hello everyone,

I run simulations on a cluster (using OpenMPI) with a Lustre filesystem
and I use HDF5 1.8.9 for data output. Each process has its own file, so
I believe there is no need for the parallel HDF5 version, is this correct?

When a larger number (> 4) processes want to dump their data at the same
time, I get various errors of paths and objects not found or any other
operation failing. I can't really make out the reason for it, as the
code works fine on my personal workstation and runs for days with writes
/ reads every 5 minutes without failing.

What I have tried so far is having one process manage all the read/write
operations so that all other processes have to check whether anyone else
is already dumping their data. I also implemented
boost::interprocess:file_lock to prevent writing in the same file, which
is however excluded by the queuing system anyway, so this was more of a
paranoid move to be absolutely sure. All that helped reducing the number
fatal errors significantly, but did not completely get rid of them. The
biggest problem is, that some of the files get corrupted when the
program crashes which is especially inconvenient.

My question is, if there is any obvious mistake I am making and how I
would go about solving this issue. My initial guess is that the Lustre
filesystem plays some role in this, since it is the only difference to
my personal computer where everything runs smoothly. As I said, neither
the errors messages nor the traceback show any consistency.

bye, Peter


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to