Ok great, thanks - it sounds perfectly safe to disable it the way we're 
using it here then.

On 01/07/16 19:34, Jiajun Cao wrote:
> Hi Jonathan,
>
> Setting the environment variable is to disable the dl plugin of DMTCP.
> What the plugin mainly does is to disable checkpointing in the middle of
> dlopen()/dlclose(). Doing so may cause undefined behavior on restart.
> Disabling the dl plugin should be okay for most applications, since most
> programs load the shared libraries during initialization. If you do not
> checkpoint the app at this point, it's safe.
>
> Having said that, I'm still not sure why it fails to open the shared
> library, since dmtcp does nothing special about the dl calls, except
> what I described above.
>
> Let me know if you have any questions,
> Jiajun
>
> On Fri, Jul 1, 2016 at 12:01 PM, Jonathan Patterson <j...@astro.ox.ac.uk
> <mailto:j...@astro.ox.ac.uk>> wrote:
>
>
>              This thread appears to have died, so in the hope of getting
>     an answer
>     from one of the developers, here's the basic question again:
>              When told to "consider setting the environment variable
>     DMTCP_DL_PLUGIN
>     to 0", what are the implications of doing this? I've seen the error with
>     every matlab job that people try to run so far. Does this error mean the
>     program will not run ok? Does disabling the DL_PLUGIN (whatever that is)
>     mean that checkpointing will not work?
>              Some guidance please....
>
>     On 29-Jun-16 8:08 AM, Jonathan Patterson wrote:
>      >
>      > Great, thank you.
>      > We're just using TCP over gigabit ethernet for the network.
>      > Slurm is 15.08.6, but it's not doing the checkpointing. I'm doing
>     that manually. As fas as slurm is concerned, there is no checkpointing.
>      > I'm not starting MPI jobs with dmtcp_launch - I'm not aiming to
>     checkpoint the MPI jobs, it's the 1-core, low-memory simple jobs
>     that I want to checkpoint, so these can be moved around to make way
>     for the more complex jobs. So I'm thinking we can leave MPI out of this.
>      > Most jobs are running with no problems, it's just the dmtcp ones
>     that *occasionally* have a problem.
>      > Some of the failing jobs (specific ones) complain about libdl.so
>     (see below), but not all of them, if that helps. Maybe we should
>     deal with that issue first?
>      > The other failing jobs fail simply with the message I posted before.
>      >
>      > [43000] WARNING at dlwrappers.cpp:75 in dlopen;
>     REASON='JWARNING(ret) failed'
>      >      filename = libirml.so.1
>      >      flag = 1
>      > Message: dlopen failed.  You may also see a message 'ERROR: ld.so:'
>      > from libdl.so.  If this happens only under DMTCP, then consider
>     setting
>      > the environment variable DMTCP_DL_PLUGIN to "0" before
>     'dmtcp_launch'.
>      > If the problem persists, please write to the DMTCP developers.
>      >
>      > [43000] NOTE at processinfo.cpp:199 in growStack;
>     REASON='bottom-most page of stack (page with highest address) was
>      >   invisible in /proc/self/maps. It is made visible again now.'
>      > [43000] WARNING at dlwrappers.cpp:75 in dlopen;
>     REASON='JWARNING(ret) failed'
>      >      filename = libcilkrts.so
>      >      flag = 1
>      > Message: dlopen failed.  You may also see a message 'ERROR: ld.so:'
>      > from libdl.so.  If this happens only under DMTCP, then consider
>     setting
>      > the environment variable DMTCP_DL_PLUGIN to "0" before
>     'dmtcp_launch'.
>      > If the problem persists, please write to the DMTCP developers.
>      >
>      > [43000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
>     REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
>      >      _magicBits =
>      > Message: read invalid message, _magicBits mismatch.  Did DMTCP
>     coordinator die uncleanly?
>      >
>      >
>      >
>      > On 28/06/16 22:59, Jiajun Cao wrote:
>      >> Hi Jonathan,
>      >>
>      >> Thanks for writing to us. We're definitely glad to help you with the
>      >> problem. Can you provide us the following info:
>      >>
>      >> What's the interconnect of the cluster, InfiniBand, TCP?
>      >>
>      >> What versions of Slurm and MPI do you use?
>      >>
>      >> Aside from the failure jobs, are the remaining jobs successful?
>     Can they
>      >> checkpoint/restart successfully?
>      >>
>      >> The log you sent is very general: it tells only that the client
>     cannot
>      >> connect to the coordinator somehow. There can be various reasons for
>      >> that. We'll need to dig further.
>      >>
>      >> Best,
>      >> Jiajun
>      >>
>      >> On Tue, Jun 28, 2016 at 11:45 AM, Jonathan Patterson
>     <j...@astro.ox.ac.uk <mailto:j...@astro.ox.ac.uk>
>      >> <mailto:j...@astro.ox.ac.uk <mailto:j...@astro.ox.ac.uk>>> wrote:
>      >>
>      >>
>      >>              Hello!
>      >>              I'm running v. 2.4.4 on CentOS 6.8, kernel
>      >>     2.6.32-431.20.3.el6.x86_64
>      >>              This is a cluster, with ~ 100 compute nodes,
>     running slurm.
>      >>              Jobs are started with dmtcp_launch --rm. The idea
>     is that
>      >>     jobs can be checkpointed as needed, to move them around between
>      >>     machines to fit jobs together to make room for high
>     memory/specific
>      >>     MPI geometry jobs. This has worked well, but...
>      >>              Out of ~ 45,000 jobs that have run so far, ~ 100 have
>      >>     errors as below. I cannot find a common compute node, time, job
>      >>     type, user, memory usage, or any other factor - it seems
>     that dmtcp
>      >>     is just randomly generating this error. This stops the job,
>     which is
>      >>     a bit of a problem. No checkpointing was attempted on these
>     jobs.
>      >>              Any ideas where I should look for the problem, anybody?
>      >>     Anything I can do to get some more debugging info? Is it the
>      >>     coordinator, or the dmtcp library wrapped around the running
>     program
>      >>     that's generating this error?
>      >>              Thanks in advance...
>      >>
>      >>     [47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
>      >>     REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) ==
>     0) failed'
>      >>           _magicBits =
>      >>     Message: read invalid message, _magicBits mismatch.  Did DMTCP
>      >>     coordinator die uncleanly?
>      >>     main-PYTHIA8-lhef (47000): Terminating...
>      >>     [40000] ERROR at coordinatorapi.cpp:601 in
>      >>     createNewConnectionBeforeFork;
>      >>     REASON='JASSERT(_coordinatorSocket.isValid()) failed'
>      >>     bash (40000): Terminating...
>      >>
>      >>
>      >>
>       
> ------------------------------------------------------------------------------
>      >>     Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T
>     Park in San
>      >>     Francisco, CA to explore cutting-edge tech and listen to
>     tech luminaries
>      >>     present their vision of the future. This family event has
>     something for
>      >>     everyone, including kids. Get more information and register
>     today.
>      >> http://sdm.link/attshape
>      >>     _______________________________________________
>      >>     Dmtcp-forum mailing list
>      >> Dmtcp-forum@lists.sourceforge.net
>     <mailto:Dmtcp-forum@lists.sourceforge.net>
>      >>     <mailto:Dmtcp-forum@lists.sourceforge.net
>     <mailto:Dmtcp-forum@lists.sourceforge.net>>
>      >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>      >>
>      >>
>      >
>      >
>     
> ------------------------------------------------------------------------------
>      > Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park
>     in San
>      > Francisco, CA to explore cutting-edge tech and listen to tech
>     luminaries
>      > present their vision of the future. This family event has
>     something for
>      > everyone, including kids. Get more information and register today.
>      > http://sdm.link/attshape
>      > _______________________________________________
>      > Dmtcp-forum mailing list
>      > Dmtcp-forum@lists.sourceforge.net
>     <mailto:Dmtcp-forum@lists.sourceforge.net>
>      > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>     >
>
>     
> ------------------------------------------------------------------------------
>     Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
>     Francisco, CA to explore cutting-edge tech and listen to tech luminaries
>     present their vision of the future. This family event has something for
>     everyone, including kids. Get more information and register today.
>     http://sdm.link/attshape
>     _______________________________________________
>     Dmtcp-forum mailing list
>     Dmtcp-forum@lists.sourceforge.net
>     <mailto:Dmtcp-forum@lists.sourceforge.net>
>     https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>

------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to