Just a point that I have run into (not a developer here), if you have your
main process connected to clients and the main process ends, this prompts
the job to be cleaned up in hpc. This is often the case for me and results
in jobs dieing uncleanly due to the stragglers clients that have yet to
kill themselves. I hope this helps but regardless I am also interested in
the resolution reached here.
On Jun 29, 2016 6:54 AM, "Jonathan Patterson" <j...@astro.ox.ac.uk> wrote:
>
> I've changed dmtcpworker.cpp so that if it gets an invalid message
> from
> the coordinator, it just quits the worker thread. This allows the actual
> program to continue running. So I've worked around the problem now, but
> have no idea why dmtcp_coordinator was (apparently) quitting. It wasn't
> being told to by slurm.
> Given that I'm running a live system here, and the error is
> intermittent, I don't think we'll find the cause, and my hack
> effectively fixes it for me.
> I'd still like to work on the DMTCP_DL_PLUGIN messages, though. I
> couldn't find any documentation as to what effect disabling the
> DMTCP_DL_PLUGIN would actually have..?
>
>
> On 29-Jun-16 8:58 AM, Jonathan Patterson wrote:
> > ... and sorry, I forgot to answer your other question. Checkpointing
> > does work well.
> >
> > On 29 June 2016 08:08:49 BST, Jonathan Patterson <j...@astro.ox.ac.uk>
> wrote:
> >
> >
> > Great, thank you.
> > We're just using TCP over gigabit ethernet for the network.
> > Slurm is 15.08.6, but it's not doing the checkpointing. I'm doing
> that manually. As fas as slurm is concerned, there is no checkpointing.
> > I'm not starting MPI jobs with dmtcp_launch - I'm not aiming to
> checkpoint the MPI jobs, it's the 1-core, low-memory simple jobs that I
> want to checkpoint, so these can be moved around to make way for the more
> complex jobs. So I'm thinking we can leave MPI out of this.
> > Most jobs are running with no problems, it's just the dmtcp ones
> that *occasionally* have a problem.
> > Some of the failing jobs (specific ones) complain about libdl.so <
> http://libdl.so> (see below), but not all of them, if that helps. Maybe
> we should deal with that issue first?
> > The other failing jobs fail simply with the message I posted before.
> >
> > [43000] WARNING at dlwrappers.cpp:75 in dlopen;
> REASON='JWARNING(ret) failed'
> > filename = libirml.so <http://libirml.so>.1
> > flag = 1
> > Message: dlopen failed. You may also see a message 'ERROR: ld.so <
> http://ld.so>:'
> > from libdl.so <http://libdl.so>. If this happens only under DMTCP,
> then consider setting
> > the environment variable DMTCP_DL_PLUGIN to "0" before
> 'dmtcp_launch'.
> > If the problem persists, please write to the DMTCP developers.
> >
> > [43000] NOTE at processinfo.cpp:199 in growStack;
> REASON='bottom-most page of stack (page with highest address) was
> > invisible in /proc/self/maps. It is made visible again now.'
> > [43000] WARNING at dlwrappers.cpp:75 in dlopen;
> REASON='JWARNING(ret) failed'
> > filename = libcilkrts.so <http://libcilkrts.so>
> > flag = 1
> > Message: dlopen failed. You may also see a message 'ERROR: ld.so <
> http://ld.so>:'
> > from libdl.so <http://libdl.so>. If this happens only under DMTCP,
> then consider setting
> > the environment variable DMTCP_DL_PLUGIN to "0" before
> 'dmtcp_launch'.
> > If the problem persists, please write to the DMTCP developers.
> >
> > [43000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
> REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
> > _magicBits =
> > Message: read invalid message, _magicBits mismatch. Did DMTCP
> coordinator die uncleanly?
> >
> >
> >
> > On 28/06/16 22:59, Jiajun Cao wrote:
> >
> > Hi Jonathan,
> >
> > Thanks for writing to us. We're definitely glad to help you with
> > the
> > problem. Can you provide us the following info:
> >
> > What's the interconnect of the cluster, InfiniBand, TCP?
> >
> > What versions of Slurm and MPI do you use?
> >
> > Aside from the failure jobs, are the remaining jobs successful?
> > Can they
> > checkpoint/restart successfully?
> >
> > The log you sent is very general: it tells only that the client
> > cannot
> > connect to the coordinator somehow. There can be various reasons
> > for
> > that. We'll need to dig further.
> >
> > Best,
> > Jiajun
> >
> > On Tue, Jun 28, 2016 at 11:45 AM, Jonathan Patterson
> > <j...@astro.ox.ac.uk
> > <mailto:j...@astro.ox.ac.uk>> wrote:
> >
> >
> > Hello!
> > I'm running v. 2.4.4 on CentOS 6.8, kernel
> > 2.6.32-431.20.3.el6.x86_64
> > This is a cluster, with ~ 100 compute nodes, running slurm.
> > Jobs are started with dmtcp_launch --rm. The idea is that
> > jobs can be checkpointed as needed, to move them around between
> > machines to fit jobs together to make room for high
> memory/specific
> > MPI geometry jobs. This has worked well, but...
> > Out of ~ 45,000 jobs that have run so far, ~ 100 have
> > errors as below. I cannot find a common compute node, time, job
> > type, user, memory usage, or any other factor - it seems that
> dmtcp
> > is just randomly generating this error. This stops the job,
> which is
> > a bit of a problem. No checkpointing was attempted on these jobs.
> > Any ideas where I should look for the problem, anybody?
> > Anything I can do to get some more debugging info? Is it the
> > coordinator, or the dmtcp library wrapped around the running
> program
> > that's generating this error?
> > Thanks in advance...
> >
> > [47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
> > REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0)
> > failed'
> > _magicBits =
> > Message: read invalid message, _magicBits mismatch. Did DMTCP
> > coordinator die uncleanly?
> > main-PYTHIA8-lhef (47000): Terminating...
> > [40000] ERROR at coordinatorapi.cpp:601 in
> > createNewConnectionBeforeFork;
> > REASON='JASSERT(_coordinatorSocket.isValid()) failed'
> > bash (40000): Terminating...
> >
> >
> >
> ------------------------------------------------------------------------
> >
> > Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park
> > in San
> > Francisco, CA to explore cutting-edge tech and listen to tech
> > luminaries
> > present their vision of the future. This family event has
> > something for
> > everyone, including kids. Get more information and register
> today.
> > http://sdm.link/attshape
> >
> ------------------------------------------------------------------------
> >
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > <mailto:Dmtcp-forum@lists.sourceforge.net>
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >
> >
> >
> >
> ------------------------------------------------------------------------
> >
> > Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in
> San
> > Francisco, CA to explore cutting-edge tech and listen to tech
> luminaries
> > present their vision of the future. This family event has something
> for
> > everyone, including kids. Get more information and register today.
> > http://sdm.link/attshape
> >
> ------------------------------------------------------------------------
> >
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >
>
>
> ------------------------------------------------------------------------------
> Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
> Francisco, CA to explore cutting-edge tech and listen to tech luminaries
> present their vision of the future. This family event has something for
> everyone, including kids. Get more information and register today.
> http://sdm.link/attshape
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum