Hi dmtcp people,
I have taken over a project that is using dmtcp to checkpoint lengthy runs.
I am having a problem in that a small percentage of jobs seem to be
quietly failing when writing/reading checkpoint files. I am using DMTCP
for jobs which use ~25G memory, I have not had any trouble with small
memory jobs.
When writing a checkpoint the program produces ".dmtcp" files, e.g.
ckpt_LandscapeSimulator_105f8c25d3b2589c-41000-5c337462.dmtcp
These files have a ".temp" extension while they are being written. I
notice that about 5% of these files stop getting written whilst still
having the .temp extension, though no error is thrown by dmtcp (the
controller says that the checkpoint was written successfully). I also
find that about the same proportion seem to successfully write the
.dmtcp file, but when resumed the dmtcp code reads them in (I can see it
unzipping when I run top on the node) but then the dmtcp processes just
sleep forever and no progress is made on the job. I though perhaps it
was a problem with gzipping the files but when I turned zipping off I
get the same problems.
Any ideas how I can get to the bottom of this? I am using version 2.3.1
as installed by the HPC administrators at the Uni. I asked them and they
have no idea!
thanks
Lawrence
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum