This thread appears to have died, so in the hope of getting an answer 
from one of the developers, here's the basic question again:
        When told to "consider setting the environment variable DMTCP_DL_PLUGIN 
to 0", what are the implications of doing this? I've seen the error with 
every matlab job that people try to run so far. Does this error mean the 
program will not run ok? Does disabling the DL_PLUGIN (whatever that is) 
mean that checkpointing will not work?
        Some guidance please....

On 29-Jun-16 8:08 AM, Jonathan Patterson wrote:
>
> Great, thank you.
> We're just using TCP over gigabit ethernet for the network.
> Slurm is 15.08.6, but it's not doing the checkpointing. I'm doing that 
> manually. As fas as slurm is concerned, there is no checkpointing.
> I'm not starting MPI jobs with dmtcp_launch - I'm not aiming to checkpoint 
> the MPI jobs, it's the 1-core, low-memory simple jobs that I want to 
> checkpoint, so these can be moved around to make way for the more complex 
> jobs. So I'm thinking we can leave MPI out of this.
> Most jobs are running with no problems, it's just the dmtcp ones that 
> *occasionally* have a problem.
> Some of the failing jobs (specific ones) complain about libdl.so (see below), 
> but not all of them, if that helps. Maybe we should deal with that issue 
> first?
> The other failing jobs fail simply with the message I posted before.
>
> [43000] WARNING at dlwrappers.cpp:75 in dlopen; REASON='JWARNING(ret) failed'
>      filename = libirml.so.1
>      flag = 1
> Message: dlopen failed.  You may also see a message 'ERROR: ld.so:'
> from libdl.so.  If this happens only under DMTCP, then consider setting
> the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'.
> If the problem persists, please write to the DMTCP developers.
>
> [43000] NOTE at processinfo.cpp:199 in growStack; REASON='bottom-most page of 
> stack (page with highest address) was
>   invisible in /proc/self/maps. It is made visible again now.'
> [43000] WARNING at dlwrappers.cpp:75 in dlopen; REASON='JWARNING(ret) failed'
>      filename = libcilkrts.so
>      flag = 1
> Message: dlopen failed.  You may also see a message 'ERROR: ld.so:'
> from libdl.so.  If this happens only under DMTCP, then consider setting
> the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'.
> If the problem persists, please write to the DMTCP developers.
>
> [43000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; 
> REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
>      _magicBits =
> Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator 
> die uncleanly?
>
>
>
> On 28/06/16 22:59, Jiajun Cao wrote:
>> Hi Jonathan,
>>
>> Thanks for writing to us. We're definitely glad to help you with the
>> problem. Can you provide us the following info:
>>
>> What's the interconnect of the cluster, InfiniBand, TCP?
>>
>> What versions of Slurm and MPI do you use?
>>
>> Aside from the failure jobs, are the remaining jobs successful? Can they
>> checkpoint/restart successfully?
>>
>> The log you sent is very general: it tells only that the client cannot
>> connect to the coordinator somehow. There can be various reasons for
>> that. We'll need to dig further.
>>
>> Best,
>> Jiajun
>>
>> On Tue, Jun 28, 2016 at 11:45 AM, Jonathan Patterson <j...@astro.ox.ac.uk
>> <mailto:j...@astro.ox.ac.uk>> wrote:
>>
>>
>>              Hello!
>>              I'm running v. 2.4.4 on CentOS 6.8, kernel
>>     2.6.32-431.20.3.el6.x86_64
>>              This is a cluster, with ~ 100 compute nodes, running slurm.
>>              Jobs are started with dmtcp_launch --rm. The idea is that
>>     jobs can be checkpointed as needed, to move them around between
>>     machines to fit jobs together to make room for high memory/specific
>>     MPI geometry jobs. This has worked well, but...
>>              Out of ~ 45,000 jobs that have run so far, ~ 100 have
>>     errors as below. I cannot find a common compute node, time, job
>>     type, user, memory usage, or any other factor - it seems that dmtcp
>>     is just randomly generating this error. This stops the job, which is
>>     a bit of a problem. No checkpointing was attempted on these jobs.
>>              Any ideas where I should look for the problem, anybody?
>>     Anything I can do to get some more debugging info? Is it the
>>     coordinator, or the dmtcp library wrapped around the running program
>>     that's generating this error?
>>              Thanks in advance...
>>
>>     [47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
>>     REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
>>           _magicBits =
>>     Message: read invalid message, _magicBits mismatch.  Did DMTCP
>>     coordinator die uncleanly?
>>     main-PYTHIA8-lhef (47000): Terminating...
>>     [40000] ERROR at coordinatorapi.cpp:601 in
>>     createNewConnectionBeforeFork;
>>     REASON='JASSERT(_coordinatorSocket.isValid()) failed'
>>     bash (40000): Terminating...
>>
>>
>>     
>> ------------------------------------------------------------------------------
>>     Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
>>     Francisco, CA to explore cutting-edge tech and listen to tech luminaries
>>     present their vision of the future. This family event has something for
>>     everyone, including kids. Get more information and register today.
>>     http://sdm.link/attshape
>>     _______________________________________________
>>     Dmtcp-forum mailing list
>>     Dmtcp-forum@lists.sourceforge.net
>>     <mailto:Dmtcp-forum@lists.sourceforge.net>
>>     https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>
>>
>
> ------------------------------------------------------------------------------
> Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
> Francisco, CA to explore cutting-edge tech and listen to tech luminaries
> present their vision of the future. This family event has something for
> everyone, including kids. Get more information and register today.
> http://sdm.link/attshape
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>

------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to