Gave it a shot and it progress past the initilization steps, but breaks at
the fist checkpoint (before it would quit after less than 20 seconds. I
currently simply have a python script changing the host names in the
restart script dmtcp uses (dmtcp_restart_script.sh) and that appears to be
enough to get succesful restarts. I changes the node/hosts at about line
240 of the restart script that starts with worker_ckpts='
Thanks for getting back to me. Id love to help you debug this at your
convenience, but will be using the above script until we can sort these
errors out or if the checkpoint/restarts are error prone due to my python
scripts incompleteness.
The error with the above header change that you suggested results in the
following:
[40000] ERROR at connectionmessage.h:63 in assertValid;
REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
sign =
Message: read invalid message, signature mismatch. (External socket?)
python2.7 (40000): Terminating...
On Tue, Apr 26, 2016 at 6:51 PM, Rohan Garg <rohg...@ccs.neu.edu> wrote:
> Hi William,
>
> Does the directory specified in the error message (`savedFilePath`)
> exist? If yes, then could you please retry after define-ing the
> `STAMPEDE_LUSTRE_FIX` macro in src/util_misc.cpp? Add `#define
> STAMPEDE_LUSTRE_FIX 1` at the top of the file (perhaps after all
> the includes). We saw a related issue on Stampede, I wonder if what
> you are seeing is the same thing.
>
> Thanks,
> Rohan
>
> On Tue, Apr 26, 2016 at 11:34:41AM -0700, William Fox wrote:
> > I am running dmtcp on a hpc and am hoping to run a distributed
> application
> > across several nodes and checkpoint/restart it.
> >
> > When I run with the --rm plugin in the torque environment the application
> > runs into an error (I believe directly after dmtcp_launch --rm):
> > [40000] ERROR at fileconnection.cpp:619 in preCkpt;
> > REASON='JASSERT(Util::createDirectoryTree(savedFilePath)) failed'
> > savedFilePath = /oasis/scratch/<checkpoint_directory>
> > Message: Unable to create directory in File Path
> > python2.7 (40000): Terminating...
> >
> > If I run without --rm then the host names are not adapted on a restart
> and
> > the application fails.
> >
> > Running a helloworld counting program that says the host files every 60
> > seconds via openmpi runs smoothly with --rm and --infiniband plugins on
> the
> > same system. I have tried to track down similar errors in the forums but
> > failed to find instances of distributed systems withing the hpc
> > environment. The checkpoints occur in the same directory for both the
> > openmpi and distributed application so it is not the folder permissions.
> >
> > Any ideas?
>
> >
> ------------------------------------------------------------------------------
> > Find and fix application performance issues faster with Applications
> Manager
> > Applications Manager provides deep performance insights into multiple
> tiers of
> > your business applications. It resolves application problems quickly and
> > reduces your MTTR. Get your free trial!
> > https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
>
> > _______________________________________________
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum