Ok great, thanks - it sounds perfectly safe to disable it the way we're using it here then.
On 01/07/16 19:34, Jiajun Cao wrote: > Hi Jonathan, > > Setting the environment variable is to disable the dl plugin of DMTCP. > What the plugin mainly does is to disable checkpointing in the middle of > dlopen()/dlclose(). Doing so may cause undefined behavior on restart. > Disabling the dl plugin should be okay for most applications, since most > programs load the shared libraries during initialization. If you do not > checkpoint the app at this point, it's safe. > > Having said that, I'm still not sure why it fails to open the shared > library, since dmtcp does nothing special about the dl calls, except > what I described above. > > Let me know if you have any questions, > Jiajun > > On Fri, Jul 1, 2016 at 12:01 PM, Jonathan Patterson <j...@astro.ox.ac.uk > <mailto:j...@astro.ox.ac.uk>> wrote: > > > This thread appears to have died, so in the hope of getting > an answer > from one of the developers, here's the basic question again: > When told to "consider setting the environment variable > DMTCP_DL_PLUGIN > to 0", what are the implications of doing this? I've seen the error with > every matlab job that people try to run so far. Does this error mean the > program will not run ok? Does disabling the DL_PLUGIN (whatever that is) > mean that checkpointing will not work? > Some guidance please.... > > On 29-Jun-16 8:08 AM, Jonathan Patterson wrote: > > > > Great, thank you. > > We're just using TCP over gigabit ethernet for the network. > > Slurm is 15.08.6, but it's not doing the checkpointing. I'm doing > that manually. As fas as slurm is concerned, there is no checkpointing. > > I'm not starting MPI jobs with dmtcp_launch - I'm not aiming to > checkpoint the MPI jobs, it's the 1-core, low-memory simple jobs > that I want to checkpoint, so these can be moved around to make way > for the more complex jobs. So I'm thinking we can leave MPI out of this. > > Most jobs are running with no problems, it's just the dmtcp ones > that *occasionally* have a problem. > > Some of the failing jobs (specific ones) complain about libdl.so > (see below), but not all of them, if that helps. Maybe we should > deal with that issue first? > > The other failing jobs fail simply with the message I posted before. > > > > [43000] WARNING at dlwrappers.cpp:75 in dlopen; > REASON='JWARNING(ret) failed' > > filename = libirml.so.1 > > flag = 1 > > Message: dlopen failed. You may also see a message 'ERROR: ld.so:' > > from libdl.so. If this happens only under DMTCP, then consider > setting > > the environment variable DMTCP_DL_PLUGIN to "0" before > 'dmtcp_launch'. > > If the problem persists, please write to the DMTCP developers. > > > > [43000] NOTE at processinfo.cpp:199 in growStack; > REASON='bottom-most page of stack (page with highest address) was > > invisible in /proc/self/maps. It is made visible again now.' > > [43000] WARNING at dlwrappers.cpp:75 in dlopen; > REASON='JWARNING(ret) failed' > > filename = libcilkrts.so > > flag = 1 > > Message: dlopen failed. You may also see a message 'ERROR: ld.so:' > > from libdl.so. If this happens only under DMTCP, then consider > setting > > the environment variable DMTCP_DL_PLUGIN to "0" before > 'dmtcp_launch'. > > If the problem persists, please write to the DMTCP developers. > > > > [43000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; > REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed' > > _magicBits = > > Message: read invalid message, _magicBits mismatch. Did DMTCP > coordinator die uncleanly? > > > > > > > > On 28/06/16 22:59, Jiajun Cao wrote: > >> Hi Jonathan, > >> > >> Thanks for writing to us. We're definitely glad to help you with the > >> problem. Can you provide us the following info: > >> > >> What's the interconnect of the cluster, InfiniBand, TCP? > >> > >> What versions of Slurm and MPI do you use? > >> > >> Aside from the failure jobs, are the remaining jobs successful? > Can they > >> checkpoint/restart successfully? > >> > >> The log you sent is very general: it tells only that the client > cannot > >> connect to the coordinator somehow. There can be various reasons for > >> that. We'll need to dig further. > >> > >> Best, > >> Jiajun > >> > >> On Tue, Jun 28, 2016 at 11:45 AM, Jonathan Patterson > <j...@astro.ox.ac.uk <mailto:j...@astro.ox.ac.uk> > >> <mailto:j...@astro.ox.ac.uk <mailto:j...@astro.ox.ac.uk>>> wrote: > >> > >> > >> Hello! > >> I'm running v. 2.4.4 on CentOS 6.8, kernel > >> 2.6.32-431.20.3.el6.x86_64 > >> This is a cluster, with ~ 100 compute nodes, > running slurm. > >> Jobs are started with dmtcp_launch --rm. The idea > is that > >> jobs can be checkpointed as needed, to move them around between > >> machines to fit jobs together to make room for high > memory/specific > >> MPI geometry jobs. This has worked well, but... > >> Out of ~ 45,000 jobs that have run so far, ~ 100 have > >> errors as below. I cannot find a common compute node, time, job > >> type, user, memory usage, or any other factor - it seems > that dmtcp > >> is just randomly generating this error. This stops the job, > which is > >> a bit of a problem. No checkpointing was attempted on these > jobs. > >> Any ideas where I should look for the problem, anybody? > >> Anything I can do to get some more debugging info? Is it the > >> coordinator, or the dmtcp library wrapped around the running > program > >> that's generating this error? > >> Thanks in advance... > >> > >> [47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; > >> REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == > 0) failed' > >> _magicBits = > >> Message: read invalid message, _magicBits mismatch. Did DMTCP > >> coordinator die uncleanly? > >> main-PYTHIA8-lhef (47000): Terminating... > >> [40000] ERROR at coordinatorapi.cpp:601 in > >> createNewConnectionBeforeFork; > >> REASON='JASSERT(_coordinatorSocket.isValid()) failed' > >> bash (40000): Terminating... > >> > >> > >> > > ------------------------------------------------------------------------------ > >> Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T > Park in San > >> Francisco, CA to explore cutting-edge tech and listen to > tech luminaries > >> present their vision of the future. This family event has > something for > >> everyone, including kids. Get more information and register > today. > >> http://sdm.link/attshape > >> _______________________________________________ > >> Dmtcp-forum mailing list > >> Dmtcp-forum@lists.sourceforge.net > <mailto:Dmtcp-forum@lists.sourceforge.net> > >> <mailto:Dmtcp-forum@lists.sourceforge.net > <mailto:Dmtcp-forum@lists.sourceforge.net>> > >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > >> > >> > > > > > > ------------------------------------------------------------------------------ > > Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park > in San > > Francisco, CA to explore cutting-edge tech and listen to tech > luminaries > > present their vision of the future. This family event has > something for > > everyone, including kids. Get more information and register today. > > http://sdm.link/attshape > > _______________________________________________ > > Dmtcp-forum mailing list > > Dmtcp-forum@lists.sourceforge.net > <mailto:Dmtcp-forum@lists.sourceforge.net> > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > ------------------------------------------------------------------------------ > Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San > Francisco, CA to explore cutting-edge tech and listen to tech luminaries > present their vision of the future. This family event has something for > everyone, including kids. Get more information and register today. > http://sdm.link/attshape > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > <mailto:Dmtcp-forum@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > ------------------------------------------------------------------------------ Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum