Hi Jonathan,
Thanks for writing to us. We're definitely glad to help you with the
problem. Can you provide us the following info:
What's the interconnect of the cluster, InfiniBand, TCP?
What versions of Slurm and MPI do you use?
Aside from the failure jobs, are the remaining jobs successful? Can they
checkpoint/restart successfully?
The log you sent is very general: it tells only that the client cannot
connect to the coordinator somehow. There can be various reasons for that.
We'll need to dig further.
Best,
Jiajun
On Tue, Jun 28, 2016 at 11:45 AM, Jonathan Patterson <j...@astro.ox.ac.uk>
wrote:
>
> Hello!
> I'm running v. 2.4.4 on CentOS 6.8, kernel
> 2.6.32-431.20.3.el6.x86_64
> This is a cluster, with ~ 100 compute nodes, running slurm.
> Jobs are started with dmtcp_launch --rm. The idea is that jobs can
> be checkpointed as needed, to move them around between machines to fit jobs
> together to make room for high memory/specific MPI geometry jobs. This has
> worked well, but...
> Out of ~ 45,000 jobs that have run so far, ~ 100 have errors as
> below. I cannot find a common compute node, time, job type, user, memory
> usage, or any other factor - it seems that dmtcp is just randomly
> generating this error. This stops the job, which is a bit of a problem. No
> checkpointing was attempted on these jobs.
> Any ideas where I should look for the problem, anybody? Anything I
> can do to get some more debugging info? Is it the coordinator, or the dmtcp
> library wrapped around the running program that's generating this error?
> Thanks in advance...
>
> [47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
> REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
> _magicBits =
> Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator
> die uncleanly?
> main-PYTHIA8-lhef (47000): Terminating...
> [40000] ERROR at coordinatorapi.cpp:601 in createNewConnectionBeforeFork;
> REASON='JASSERT(_coordinatorSocket.isValid()) failed'
> bash (40000): Terminating...
>
>
>
> ------------------------------------------------------------------------------
> Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
> Francisco, CA to explore cutting-edge tech and listen to tech luminaries
> present their vision of the future. This family event has something for
> everyone, including kids. Get more information and register today.
> http://sdm.link/attshape
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum