Dear forum members, I have observed an annoying occurence many times now. I'm running parallel HDF5 (1.8.14) on top of OpenMPI (1.7.2) with gcc (4.8.1) on a OpenSuse Linux (13.1). The storage is located on a NFS Server.
Running on typically 4 cores, I'm writing relatively large files (at least several hundred MB, sometimes many GB) in parallel with HDF5. Sometimes I have to interrupt the code with a CTRL+C signal during such a write operation (often because of user error). Occasionally, this will cause a catastrophic hangup, and I get the error message: kernel BUG: soft lockup - CPU stuck for 23s! This will invariably cause a violent system crash after a very short time. I have observed this on at least 5 different machines (same software stack), and so I don't believe it is a hardware problem. Since these lockups only happen during interrupted write operations, I suspect the HDF5 library to be causing them in some way, possibly not freeing some resources. Of course, it could also be caused by OpenMPI. Due to the highly disruptive nature of the problem, I am not keen to try it too often. I cannot easily try a different (or newer) MPI implementation. It might also be caused by the fact that I'm not writing to a physical drive, but a NFS drive. Hence a general question, without appending example code: Has anyone observed this behavior before, and if so, is there a fix? Am I blaming HDF5 unfairly, and another cause is more likely? If this error is unheard of, it's most likely caused by my setup... Thanks, Wolf -- _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
