Hi dmtcp people,

I have taken over a project that is using dmtcp to checkpoint lengthy runs.

I am having a problem in that a small percentage of jobs seem to be quietly failing when writing/reading checkpoint files. I am using DMTCP for jobs which use ~25G memory, I have not had any trouble with small memory jobs.

When writing a checkpoint the program produces ".dmtcp" files, e.g.

ckpt_LandscapeSimulator_105f8c25d3b2589c-41000-5c337462.dmtcp

These files have a ".temp" extension while they are being written. I notice that about 5% of these files stop getting written whilst still having the .temp extension, though no error is thrown by dmtcp (the controller says that the checkpoint was written successfully). I also find that about the same proportion seem to successfully write the .dmtcp file, but when resumed the dmtcp code reads them in (I can see it unzipping when I run top on the node) but then the dmtcp processes just sleep forever and no progress is made on the job. I though perhaps it was a problem with gzipping the files but when I turned zipping off I get the same problems.

Any ideas how I can get to the bottom of this? I am using version 2.3.1 as installed by the HPC administrators at the Uni. I asked them and they have no idea!

thanks

Lawrence



_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to