Thank you, Berk, Justin, and Matthew, for your assistance.
I checked with my sysadmin, who said:
The /global/scratch FS is Lustre. It is fully POSIX and the fsync etc
are fully and well implemented. However when the 'power off' command is
issued there is no way OS can finish I/O in a controlled way.
Note that the power off command was given when they
realized that they had lost all cooling in the data room, and they had just a
few
minutes to react, forcing them to shutdown all compute nodes.
Justin's suggestion to use -cpnum is good, although I think it will be easier
to simply have a script that runs
gmxcheck once every 12 hours and backs up the .cpt file if it is ok.
I don't know enough about computer OS's to say if there is any possible way for
gromacs to avoid this
in the future, but if it was possible, then it would be useful.
Thank you again,
Chris.
-- original message --
Gromacs calls fsync for every checkpoint file written:
fsync() transfers ("flushes") all modified in-core data of (i.e., modi-
fied buffer cache pages for) the file referred to by the file descrip-
tor fd to the disk device (or other permanent storage device) so that
all changed information can be retrieved even after the system crashed
or was rebooted. This includes writing through or flushing a disk
cache if present. The call blocks until the device reports that the
transfer has completed. It also flushes metadata information associ-
ated with the file (see stat(2)).
If fsync fails, mdrun exits with a fatal error.
We have experience with unreliable AFS file systems, where fsync mdrun could
wait for hours and fail,
for which we added an environment variable.
So either fsync is not supported on your system (highly unlikely)
or your file system returns 0, indicating the file was synched, but it actually
didn't fully sync.
Note that we first write a new checkpoint file with number, fynsc that, then
move the current
to _prev (thereby loosing the old prev) and then the numbered one to the
current.
So you should never end up with only corrupted files, unless fsync doesn't do
what it's supposed to do.
Cheers,
Berk
--
gmx-users mailing list [email protected]
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [email protected].
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists