Hi everyone,

Happy Memorial Day!

I have a question regarding *dmtcp/2.3* and SLURM.
We use SLURM on our clusters and I want to checkpoint an application that
takes few weeks to finish.
The application I am using is OpenMP based.

I first ran
*dmtcp_launch --rm --interval 43200 ./assemble.sh*
and after the job was terminated,
*./dmtcp_restart_script.sh*.
I repeated the latter 4 times.
The first 2 times I ran *./dmtcp_restart_script.sh* I got the checkpointing
files.
However, for the next 2 runs of *./dmtcp_restart_script.sh* I didn't get
any checkpointing results, and I kept getting the following error:
*dmtcp_coordinator starting...*
*    Host: c2503.unl.edu <http://c2503.unl.edu>*
*    Port: 7779*
*    Checkpoint Interval: 10*
*    Exit on last client: 1*
*Backgrounding...*
*[1430000] ERROR at fileconnection.cpp:708 in refill;
REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed'*
*     _path = /test/tmp.96merParcels/log*
*Message: File not found.*
*RecoverUnipaths (1430000): Terminating...*
*[43000] ERROR at connectionidentifier.h:96 in assertValid;
REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'*
*     sign = *
*Message: read invalid message, signature mismatch. (External socket?)*

On your github repository you have example submit and restart scripts for
dmtcp and SLURM so I tried that as well.
I observed the same behaviour as above, just without any errors (after the
second run of *./dmtcp_restart_script.sh* I didn't get any checkpointing
files nor errors).

To me it looks like only the first two runs of *./dmtcp_restart_script.sh*
produce output files, while no checkpointing files are generated afterwards
(run 3, 4, ...).
I am not sure whether I am missing some additional settings in the
submit/restart scripts to prevent this behaviour.
I am using *dmtcp/2.3*, so should I maybe use *dmtcp/2.4* instead ?

I would highly appreciate your help and possible solutions for this issue.

P.S. I think there is a small typo in the available cmd options for
*dmtcp/2.3*.
In the cmd help it says that the time interval can be assigned using the
flags* -i* and* -interval*.
However, the flag *-interval *is not recognizable, so *--interval* should
be used instead.

I am looking forward to hearing from you soon.

Thank you,
Best Regards,
Natasha
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to