... and sorry, I forgot to answer your other question. Checkpointing does work 
well.

On 29 June 2016 08:08:49 BST, Jonathan Patterson <j...@astro.ox.ac.uk> wrote:
>
>Great, thank you.
>We're just using TCP over gigabit ethernet for the network.
>Slurm is 15.08.6, but it's not doing the checkpointing. I'm doing that
>manually. As fas as slurm is concerned, there is no checkpointing.
>I'm not starting MPI jobs with dmtcp_launch - I'm not aiming to
>checkpoint the MPI jobs, it's the 1-core, low-memory simple jobs that I
>want to checkpoint, so these can be moved around to make way for the
>more complex jobs. So I'm thinking we can leave MPI out of this.
>Most jobs are running with no problems, it's just the dmtcp ones that
>*occasionally* have a problem.
>Some of the failing jobs (specific ones) complain about libdl.so (see
>below), but not all of them, if that helps. Maybe we should deal with
>that issue first?
>The other failing jobs fail simply with the message I posted before.
>
>[43000] WARNING at dlwrappers.cpp:75 in dlopen; REASON='JWARNING(ret)
>failed'
>     filename = libirml.so.1
>     flag = 1
>Message: dlopen failed.  You may also see a message 'ERROR: ld.so:'
>from libdl.so.  If this happens only under DMTCP, then consider setting
>the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'.
>If the problem persists, please write to the DMTCP developers.
>
>[43000] NOTE at processinfo.cpp:199 in growStack; REASON='bottom-most
>page of stack (page with highest address) was
>  invisible in /proc/self/maps. It is made visible again now.'
>[43000] WARNING at dlwrappers.cpp:75 in dlopen; REASON='JWARNING(ret)
>failed'
>     filename = libcilkrts.so
>     flag = 1
>Message: dlopen failed.  You may also see a message 'ERROR: ld.so:'
>from libdl.so.  If this happens only under DMTCP, then consider setting
>the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'.
>If the problem persists, please write to the DMTCP developers.
>
>[43000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
>REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
>     _magicBits = 
>Message: read invalid message, _magicBits mismatch.  Did DMTCP
>coordinator die uncleanly?
>
>
>
>On 28/06/16 22:59, Jiajun Cao wrote:
>> Hi Jonathan,
>> 
>> Thanks for writing to us. We're definitely glad to help you with the 
>> problem. Can you provide us the following info:
>> 
>> What's the interconnect of the cluster, InfiniBand, TCP?
>> 
>> What versions of Slurm and MPI do you use?
>> 
>> Aside from the failure jobs, are the remaining jobs successful? Can
>they 
>> checkpoint/restart successfully?
>> 
>> The log you sent is very general: it tells only that the client
>cannot 
>> connect to the coordinator somehow. There can be various reasons for 
>> that. We'll need to dig further.
>> 
>> Best,
>> Jiajun
>> 
>> On Tue, Jun 28, 2016 at 11:45 AM, Jonathan Patterson
><j...@astro.ox.ac.uk 
>> <mailto:j...@astro.ox.ac.uk>> wrote:
>> 
>> 
>>              Hello!
>>              I'm running v. 2.4.4 on CentOS 6.8, kernel
>>     2.6.32-431.20.3.el6.x86_64
>>              This is a cluster, with ~ 100 compute nodes, running
>slurm.
>>              Jobs are started with dmtcp_launch --rm. The idea is
>that
>>     jobs can be checkpointed as needed, to move them around between
>>     machines to fit jobs together to make room for high
>memory/specific
>>     MPI geometry jobs. This has worked well, but...
>>              Out of ~ 45,000 jobs that have run so far, ~ 100 have
>>     errors as below. I cannot find a common compute node, time, job
>>     type, user, memory usage, or any other factor - it seems that
>dmtcp
>>     is just randomly generating this error. This stops the job, which
>is
>>     a bit of a problem. No checkpointing was attempted on these jobs.
>>              Any ideas where I should look for the problem, anybody?
>>     Anything I can do to get some more debugging info? Is it the
>>     coordinator, or the dmtcp library wrapped around the running
>program
>>     that's generating this error?
>>              Thanks in advance...
>> 
>>     [47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
>>     REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0)
>failed'
>>           _magicBits =
>>     Message: read invalid message, _magicBits mismatch.  Did DMTCP
>>     coordinator die uncleanly?
>>     main-PYTHIA8-lhef (47000): Terminating...
>>     [40000] ERROR at coordinatorapi.cpp:601 in
>>     createNewConnectionBeforeFork;
>>     REASON='JASSERT(_coordinatorSocket.isValid()) failed'
>>     bash (40000): Terminating...
>> 
>> 
>>    
>------------------------------------------------------------------------------
>>     Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park
>in San
>>     Francisco, CA to explore cutting-edge tech and listen to tech
>luminaries
>>     present their vision of the future. This family event has
>something for
>>     everyone, including kids. Get more information and register
>today.
>>     http://sdm.link/attshape
>>     _______________________________________________
>>     Dmtcp-forum mailing list
>>     Dmtcp-forum@lists.sourceforge.net
>>     <mailto:Dmtcp-forum@lists.sourceforge.net>
>>     https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>> 
>> 
>
>------------------------------------------------------------------------------
>Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
>Francisco, CA to explore cutting-edge tech and listen to tech
>luminaries
>present their vision of the future. This family event has something for
>everyone, including kids. Get more information and register today.
>http://sdm.link/attshape
>_______________________________________________
>Dmtcp-forum mailing list
>Dmtcp-forum@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to