Hi all,

 

I'm using a cluster that uses Torque as the batch system.  About half of the
time, checkpointing fails while copying the temporary output buffer/file
with the following error:

 

[27763] ERROR at connection.cpp:1214 in CopyFile;
REASON='JASSERT(_real_system(command.c_str()) != -1) failed'

 

The generic system command is "cp -f
/var/spool/torque/spool/jobid.myserver.OU
/checkpoint_dir/ckpt_myprog_52b886013bb1c112-27763-51060104_files/jobid.myse
rver.OU_99001"

 

I'm using dmtcp_checkpoint (v1.2.6) with the --checkpoint-open-files option.
Is anyone familiar with Torque enough to suggest why the file might not
exist at the time of checkpointing, or what else might be the cause of the
CopyFile failure?

 

Thanks,

Kit

 

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to