I've changed dmtcpworker.cpp so that if it gets an invalid message from 
the coordinator, it just quits the worker thread. This allows the actual 
program to continue running. So I've worked around the problem now, but 
have no idea why dmtcp_coordinator was (apparently) quitting. It wasn't 
being told to by slurm.
        Given that I'm running a live system here, and the error is 
intermittent, I don't think we'll find the cause, and my hack 
effectively fixes it for me.
        I'd still like to work on the DMTCP_DL_PLUGIN messages, though. I 
couldn't find any documentation as to what effect disabling the 
DMTCP_DL_PLUGIN would actually have..?


On 29-Jun-16 8:58 AM, Jonathan Patterson wrote:
> ... and sorry, I forgot to answer your other question. Checkpointing
> does work well.
>
> On 29 June 2016 08:08:49 BST, Jonathan Patterson <j...@astro.ox.ac.uk> wrote:
>
>
>     Great, thank you.
>     We're just using TCP over gigabit ethernet for the network.
>     Slurm is 15.08.6, but it's not doing the checkpointing. I'm doing that 
> manually. As fas as slurm is concerned, there is no checkpointing.
>     I'm not starting MPI jobs with dmtcp_launch - I'm not aiming to 
> checkpoint the MPI jobs, it's the 1-core, low-memory simple jobs that I want 
> to checkpoint, so these can be moved around to make way for the more complex 
> jobs. So I'm thinking we can leave MPI out of this.
>     Most jobs are running with no problems, it's just the dmtcp ones that 
> *occasionally* have a problem.
>     Some of the failing jobs (specific ones) complain about libdl.so 
> <http://libdl.so> (see below), but not all of them, if that helps. Maybe we 
> should deal with that issue first?
>     The other failing jobs fail simply with the message I posted before.
>
>     [43000] WARNING at dlwrappers.cpp:75 in dlopen; REASON='JWARNING(ret) 
> failed'
>          filename = libirml.so <http://libirml.so>.1
>          flag = 1
>     Message: dlopen failed.  You may also see a message 'ERROR: ld.so 
> <http://ld.so>:'
>     from libdl.so <http://libdl.so>.  If this happens only under DMTCP, then 
> consider setting
>     the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'.
>     If the problem persists, please write to the DMTCP developers.
>
>     [43000] NOTE at processinfo.cpp:199 in growStack; REASON='bottom-most 
> page of stack (page with highest address) was
>       invisible in /proc/self/maps. It is made visible again now.'
>     [43000] WARNING at dlwrappers.cpp:75 in dlopen; REASON='JWARNING(ret) 
> failed'
>          filename = libcilkrts.so <http://libcilkrts.so>
>          flag = 1
>     Message: dlopen failed.  You may also see a message 'ERROR: ld.so 
> <http://ld.so>:'
>     from libdl.so <http://libdl.so>.  If this happens only under DMTCP, then 
> consider setting
>     the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'.
>     If the problem persists, please write to the DMTCP developers.
>
>     [43000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; 
> REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
>          _magicBits =
>     Message: read invalid message, _magicBits mismatch.  Did DMTCP 
> coordinator die uncleanly?
>
>
>
>     On 28/06/16 22:59, Jiajun Cao wrote:
>
>         Hi Jonathan,
>
>         Thanks for writing to us. We're definitely glad to help you with
>         the
>         problem. Can you provide us the following info:
>
>         What's the interconnect of the cluster, InfiniBand, TCP?
>
>         What versions of Slurm and MPI do you use?
>
>         Aside from the failure jobs, are the remaining jobs successful?
>         Can they
>         checkpoint/restart successfully?
>
>         The log you sent is very general: it tells only that the client
>         cannot
>         connect to the coordinator somehow. There can be various reasons
>         for
>         that. We'll need to dig further.
>
>         Best,
>         Jiajun
>
>         On Tue, Jun 28, 2016 at 11:45 AM, Jonathan Patterson
>         <j...@astro.ox.ac.uk
>         <mailto:j...@astro.ox.ac.uk>> wrote:
>
>
>         Hello!
>         I'm running v. 2.4.4 on CentOS 6.8, kernel
>         2.6.32-431.20.3.el6.x86_64
>         This is a cluster, with ~ 100 compute nodes, running slurm.
>         Jobs are started with dmtcp_launch --rm. The idea is that
>         jobs can be checkpointed as needed, to move them around between
>         machines to fit jobs together to make room for high memory/specific
>         MPI geometry jobs. This has worked well, but...
>         Out of ~ 45,000 jobs that have run so far, ~ 100 have
>         errors as below. I cannot find a common compute node, time, job
>         type, user, memory usage, or any other factor - it seems that dmtcp
>         is just randomly generating this error. This stops the job, which is
>         a bit of a problem. No checkpointing was attempted on these jobs.
>         Any ideas where I should look for the problem, anybody?
>         Anything I can do to get some more debugging info? Is it the
>         coordinator, or the dmtcp library wrapped around the running program
>         that's generating this error?
>         Thanks in advance...
>
>         [47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
>         REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0)
>         failed'
>         _magicBits =
>         Message: read invalid message, _magicBits mismatch. Did DMTCP
>         coordinator die uncleanly?
>         main-PYTHIA8-lhef (47000): Terminating...
>         [40000] ERROR at coordinatorapi.cpp:601 in
>         createNewConnectionBeforeFork;
>         REASON='JASSERT(_coordinatorSocket.isValid()) failed'
>         bash (40000): Terminating...
>
>
>         
> ------------------------------------------------------------------------
>
>         Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park
>         in San
>         Francisco, CA to explore cutting-edge tech and listen to tech
>         luminaries
>         present their vision of the future. This family event has
>         something for
>         everyone, including kids. Get more information and register today.
>         http://sdm.link/attshape
>         
> ------------------------------------------------------------------------
>
>         Dmtcp-forum mailing list
>         Dmtcp-forum@lists.sourceforge.net
>         <mailto:Dmtcp-forum@lists.sourceforge.net>
>         https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
>
>     ------------------------------------------------------------------------
>
>     Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
>     Francisco, CA to explore cutting-edge tech and listen to tech luminaries
>     present their vision of the future. This family event has something for
>     everyone, including kids. Get more information and register today.
>     http://sdm.link/attshape
>     ------------------------------------------------------------------------
>
>     Dmtcp-forum mailing list
>     Dmtcp-forum@lists.sourceforge.net
>     https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>

------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to