Hi Rob,

Could you please try the attached patch? Basically it removes
the lock/unlock operations in the fork() wrapper.

Note this is a temp solution that helps us to better understand
the bug. If you do not see the hanging anymore, we'll think about
a more elegant fix.

Best,
Jiajun

On Fri, Aug 04, 2017 at 03:56:05PM -0700, Rob Egan wrote:
> Hi Rohan,
> Great!  Thank you.  I got your email at 11:23 AM and this one at 3:15PM.
> I'll watch that bug and test the patch when it is available too.
> Best,
> Rob
> 
> On Fri, Aug 4, 2017 at 3:15 PM, Rohan Garg <rohg...@ccs.neu.edu> wrote:
> 
> > Hi Rob,
> >
> > For some reason, the DMTCP forum on SourceForge is acting weird;
> > it rejected my last message. So I'm not sure if you received my
> > last e-mail.
> >
> > We are looking into this actively. Jiajun has also opened an issue
> > on Github (see: https://github.com/dmtcp/dmtcp/issues/612), which
> > we believe is the reason for the bug that you are seeing. Hopefully, we
> > should have a fix soon.
> >
> > -Rohan
> >
> > On Fri, Aug 04, 2017 at 02:23:43PM -0400, Rohan Garg wrote:
> > > Hi Rob,
> > >
> > >
> > > I was discussing this with Jiajun, and it seems like we might be able to
> > > write a simple test program to simulate this bug. This will definitely
> > > require a fix in DMTCP. Hopefully, we can provide a patch to you soon.
> > >
> > > I'll try to work on it this weekend. We'll see if a screen sharing
> > > session is necessary.
> > >
> > > Thanks,
> > > Rohan
> > >
> > > On Tue, Aug 01, 2017 at 11:00:50PM -0700, Rob Egan wrote:
> > > > Thanks Jiajun & Rohan!
> > > >
> > > > So in that trace before threads 2 through 41 are all identical, so
> > almost
> > > > all the threads are that nanosleep.  Since this is third party
> > software, my
> > > > ability to change the environment and code is quite limited and python
> > 3.4
> > > > is the only version that I was able to get working in this environment.
> > > >
> > > > Jiajun's explanation sounds like the issue to me.  What would be the
> > next
> > > > steps to address it?  Would we need to patch dmtcp? openblas?
> > > >
> > > > I'd be happy to arrange a screen sharing / zoom / hangouts session
> > with one
> > > > or more of you.
> > > >
> > > > -Rob
> > > >
> > > > On Tue, Aug 1, 2017 at 9:37 AM, Jiajun Cao <jia...@ccs.neu.edu> wrote:
> > > >
> > > > > Hi Rob,
> > > > >
> > > > > I know what's going on here after browsing some blas
> > > > > source code. As Rohan mentioned, thread 42 is behaving normally.
> > > > > What looks more interesting is the calling stack of thread 1, the
> > > > > main thread, especially the bottom 5 frames. It seems that the blas
> > > > > library registers some handler at fork, using pthread_atfork():
> > > > >
> > > > > https://github.com/xianyi/OpenBLAS/blob/
> > e70a6b92bf9f69f113e198e3ded558
> > > > > ed0675b038/driver/others/memory.c
> > > > >
> > > > > The handler tries to join a bunch of other blas threads, in this
> > case,
> > > > > including thread 41.
> > > > >
> > > > > The problem is, DMTCP has its internal synchronization semantics.
> > > > > When a thread exits, it tries to grab a lock to preventing a
> > checkpoint
> > > > > from happening: it is an internal critical section. On the other
> > hand,
> > > > > when the application calls fork(), it also grabs the lock before
> > fork()
> > > > > returns: another internal critical section. That's why we see a
> > deadlock
> > > > > here:
> > > > >
> > > > > thread 1: lock acquired, and waiting for thread 41 to finish
> > > > > thread 41: trying to acquire the lock before it exits
> > > > >
> > > > > Best,
> > > > > Jiajun
> > > > >
> > > > > On Mon, Jul 31, 2017 at 10:15:15PM -0700, Rohan Garg wrote:
> > > > > > Hi Rob,
> > > > > >
> > > > > > Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in every
> > > > > > user process that's launched using DMTCP. It seems to be simply
> > > > > > waiting for a message from the coordinator, which is the normal
> > > > > > state of things. It's not stuck or the cause of the deadlock here.
> > > > > >
> > > > > > The deadlock is definitely because of one or more user threads. It
> > > > > > seems like the program also forks and execs some child processes.
> > > > > > We have seen some corner cases around that but it's difficult to
> > > > > > say what it is with the information below.
> > > > > >
> > > > > > The fact that even `--disable-all-plugins` didn't work is quite
> > > > > > interesting.
> > > > > >
> > > > > > I have a few other suggestions:
> > > > > >
> > > > > >  - Is it possible to share perhaps the top 20-25 frames from each
> > of
> > > > > >    the threads? (You can also send them to us/me directly at:
> > > > > >    dm...@ccs.neu.edu/rohg...@ccs.neu.edu)
> > > > > >
> > > > > >    Also, after you save the backtraces from all the threads, detach
> > > > > >    from gdb, wait for a few seconds, re-attach, and save a second
> > > > > >    set of backtraces. It'll help us figure out which threads are
> > > > > >    really stuck.
> > > > > >
> > > > > >  - Is it possible to run with other Python versions (not 3.4) as a
> > > > > >    quick test?
> > > > > >
> > > > > >  - What Numpy/OpenBLAS versions are you using? I'll try to run
> > > > > >    a few experiments locally.
> > > > > >
> > > > > >  - Is it possible try to cut down your Python script to the bare
> > > > > >    minimum, to isolate the issue? Perhaps just with a few imports
> > > > > >    and then add code incrementally. But if it turns out to be more
> > > > > >    work than its worth, then we should skip this suggestion.
> > > > > >
> > > > > >  - Finally, if it's possible, we could arrange for a screen sharing
> > > > > >    session. I think it'll be more efficient for interactively
> > > > > >    debugging this.
> > > > > >
> > > > > > -Rohan
> > > > > >
> > > > > > On Mon, Jul 31, 2017 at 03:30:55PM -0700, Rob Egan wrote:
> > > > > > > Hi Rohan,
> > > > > > >
> > > > > > > Thanks for the quick reply.  So using version 2.4.8 and the
> > > > > > > --disable-alloc-plugin and --disable-dl-plugin the behavior is
> > the
> > > > > same.  I
> > > > > > > also tried --disable-all-plugins and it still did not work, so I
> > think
> > > > > this
> > > > > > > may be different from the numpy bug you identified, though as
> > you will
> > > > > will
> > > > > > > see the stack trace does show one of the threads in the numpy /
> > > > > openblas
> > > > > > > library.
> > > > > > >
> > > > > > > Though we are running at NERSC, I am not running this on the cray
> > > > > systems
> > > > > > > (yet) and not through slurm (yet).
> > > > > > >
> > > > > > > When I attach to the running process, unfortunately there are a
> > ton of
> > > > > > > threads and a deep python stack.  Apparently even though I only
> > > > > requested 1
> > > > > > > thread, the python library allocates a ton of them.  I tried to
> > remove
> > > > > > > duplicates manually for you.  It looks like thread 42 and 1 are
> > the
> > > > > > > interesting ones.  2-41 are all the same.  It looks to me like
> > > > > > > dmtcp_coordinator is interfering with the blas library's
> > pthread_join
> > > > > calls.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thread 42 (Thread 0x7f266738e700 (LWP 14136)):
> > > > > > > #0  0x00007f26690ed14d in read () from /lib/libpthread.so.0
> > > > > > > #1  0x00007f2669804224 in _real_read (fd=<optimized out>,
> > > > > buf=<optimized
> > > > > > > out>, count=<optimized out>) at syscallsreal.c:504
> > > > > > > #2  0x00007f26697eae80 in dmtcp::Util::readAll (fd=821,
> > > > > buf=0x7f266738bc00,
> > > > > > > count=136) at util_misc.cpp:279
> > > > > > > #3  0x00007f26697c303b in operator>><dmtcp::DmtcpMessage>
> > (t=...,
> > > > > > > this=0x7f2669c33408) at ../jalib/jsocket.h:109
> > > > > > > #4  dmtcp::CoordinatorAPI::recvMsgFromCoordinator
> > > > > (this=0x7f2669c33408,
> > > > > > > msg=0x7f266738bc00, extraData=0x0) at coordinatorapi.cpp:422
> > > > > > > #5  0x00007f26697bb9fc in dmtcp::DmtcpWorker::
> > waitForCoordinatorMsg
> > > > > > > (msgStr=..., type=dmtcp::DMT_DO_SUSPEND) at dmtcpworker.cpp:469
> > > > > > > #6  0x00007f26697bc58b in dmtcp::DmtcpWorker::waitForStage1Suspend
> > ()
> > > > > at
> > > > > > > dmtcpworker.cpp:513
> > > > > > > #7  0x00007f26697ccd1a in dmtcp::callbackSleepBetweenCheckpoint
> > > > > > > (sec=<optimized out>) at mtcpinterface.cpp:61
> > > > > > > #8  0x00007f26697d85e9 in checkpointhread (dummy=<optimized
> > out>) at
> > > > > > > threadlist.cpp:359
> > > > > > > #9  0x00007f26697cfb9e in pthread_start (arg=<optimized out>) at
> > > > > > > threadwrappers.cpp:159
> > > > > > > #10 0x00007f26690e58ca in start_thread () from
> > /lib/libpthread.so.0
> > > > > > > #11 0x00007f26697cf9c9 in clone_start (arg=0x7f2669c15008) at
> > > > > > > threadwrappers.cpp:68
> > > > > > > #12 0x00007f26687c3abd in clone () from /lib/libc.so.6
> > > > > > > #13 0x0000000000000000 in ?? ()
> > > > > > >
> > > > > > > Thread 41 (Thread 0x7f2660f99700 (LWP 14137)):
> > > > > > > #0  0x00007f26690ed86d in nanosleep () from /lib/libpthread.so.0
> > > > > > > #1  0x00007f26697bf599 in dmtcp::ThreadSync::
> > wrapperExecutionLockLock
> > > > > () at
> > > > > > > threadsync.cpp:392
> > > > > > > #2  0x00007f26697cfba6 in pthread_start (arg=<optimized out>) at
> > > > > > > threadwrappers.cpp:161
> > > > > > > #3  0x00007f26690e58ca in start_thread () from
> > /lib/libpthread.so.0
> > > > > > > #4  0x00007f26697cf9c9 in clone_start (arg=0x7f2669c24008) at
> > > > > > > threadwrappers.cpp:68
> > > > > > > #5  0x00007f26687c3abd in clone () from /lib/libc.so.6
> > > > > > > #6  0x0000000000000000 in ?? ()
> > > > > > >
> > > > > > > Thread 40 (Thread 0x7f2660798700 (LWP 14138)):
> > > > > > > ...
> > > > > > >
> > > > > > > Thread 1 (Thread 0x7f2669c26740 (LWP 14131)):
> > > > > > > #0  0x00007f26690ece69 in __lll_timedwait_tid () from
> > > > > /lib/libpthread.so.0
> > > > > > > #1  0x00007f26690e6ebf in pthread_timedjoin_np () from
> > > > > /lib/libpthread.so.0
> > > > > > > #2  0x00007f26697d0048 in pthread_join (thread=139802812454656,
> > > > > retval=0x0)
> > > > > > > at threadwrappers.cpp:274
> > > > > > > #3  0x00007f26615cd2d7 in blas_thread_shutdown_ () from
> > > > > > > /global/homes/r/regan/.local/lib/python3.4/site-packages/
> > > > > numpy/core/../.libs/
> > > > > > > libopenblasp-r0-39a31c03.2.18.so
> > > > > > > #4  0x00007f26687936d5 in fork () from /lib/libc.so.6
> > > > > > > #5  0x00007f26697c9724 in fork () at execwrappers.cpp:186
> > > > > > > #6  0x00007f2665b9ed7f in subprocess_fork_exec (self=<optimized
> > out>,
> > > > > > > args=<optimized out>) at
> > > > > > > /global/common/genepool/usg/languages/python/build/Python-
> > > > > 3.4.3/Modules/_posixsubprocess.c:649
> > > > > > > #7  0x00007f2669427225 in call_function (oparg=<optimized out>,
> > > > > > > pp_stack=0x7fff86d8d690) at Python/c
> > > > > > > eval.c:4237
> > > > > > > #8  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #9  0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=19, kws=0x7f25f4014ad8, kwcount=0, defs=0x0,
> > > > > > >     defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
> > > > > > > #10 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8d880,
> > func=0x7f2669a60ea0)
> > > > > at
> > > > > > > Python/ceval.c:4344
> > > > > > > #11 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8d880) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #12 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #13 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=2, kws=0x7f25fa0dda70, kwcount=4,
> > > > > > >     defs=0x7f2669a55110, defcount=16, kwdefs=0x0, closure=0x0) at
> > > > > > > Python/ceval.c:3588
> > > > > > > #14 0x00007f266938a2bc in function_call (func=0x7f2669a60840,
> > > > > > > arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at
> > Objects/funcobject.c:632
> > > > > > > #15 0x00007f266935e8be in PyObject_Call (func=0x7f2669a60840,
> > > > > > > arg=<optimized out>, kw=<optimized out>) at
> > Objects/abstract.c:2040
> > > > > > > #16 0x00007f2669375354 in method_call (func=0x7f2669a60840,
> > > > > > > arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at
> > Objects/classobject.c:347
> > > > > > > #17 0x00007f266935e8be in PyObject_Call (func=0x7f2663ff4148,
> > > > > > > arg=<optimized out>, kw=<optimized out>) at
> > Objects/abstract.c:2040
> > > > > > > #18 0x00007f26693c15f7 in slot_tp_init (self=0x7f25f9f45198,
> > > > > > > args=0x7f25f9f45128, kwds=0x7f25f9f46208) at
> > Objects/typeobject.c:6177
> > > > > > > #19 0x00007f26693bd17f in type_call (type=<optimized out>,
> > > > > > > args=0x7f25f9f45128, kwds=0x7f25f9f46208) at
> > Objects/typeobject.c:898
> > > > > > > #20 0x00007f266935e8be in PyObject_Call (func=0x1b8c0a8,
> > arg=<optimized
> > > > > > > out>, kw=<optimized out>) at Objects/abstract.c:2040
> > > > > > > #21 0x00007f2669425a06 in do_call (nk=<optimized out>,
> > na=<optimized
> > > > > out>,
> > > > > > > pp_stack=0x7fff86d8dc80, func=0x1b8c0a8) at Python/ceval.c:4466
> > > > > > > #22 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8dc80) at
> > > > > > > Python/ceval.c:4264
> > > > > > > #23 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #24 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=1, pp_stack=0x7fff86d8ddc0, func=0x7f25f9e96bf8) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #25 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8ddc0) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #26 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #27 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=1, pp_stack=0x7fff86d8df00, func=0x7f25f9e96c80) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #28 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8df00) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #29 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #30 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0,
> > > > > > >     kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
> > > > > > > #31 0x00007f2669428dbb in PyEval_EvalCode (co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>) at
> > Python/ceval.c:775
> > > > > > > #32 0x00007f266941d3d8 in builtin_exec (self=<optimized out>,
> > > > > > > args=<optimized out>) at Python/bltinmodule.c:882
> > > > > > > #33 0x00007f2669426247 in ext_do_call (nk=0, na=0,
> > flags=<optimized
> > > > > out>,
> > > > > > > pp_stack=0x7fff86d8e1f0, func=0x7f2669c05788) at
> > Python/ceval.c:4558
> > > > > > > #34 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2878
> > > > > > > #35 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=3, kws=0x7f2663d13938, kwcount=0, defs=0x0,
> > > > > > >     defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
> > > > > > > #36 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8e3e0,
> > func=0x7f2669bd3b70)
> > > > > at
> > > > > > > Python/ceval.c:4344
> > > > > > > #37 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8e3e0) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #38 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #39 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=2, pp_stack=0x7fff86d8e520, func=0x7f2669b5d158) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #40 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8e520) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #41 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #42 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=2, pp_stack=0x7fff86d8e660, func=0x7f2669b56ea0) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #43 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8e660) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #44 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #45 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=1, pp_stack=0x7fff86d8e7a0, func=0x7f2669b590d0) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #46 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8e7a0) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #47 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #48 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=2, pp_stack=0x7fff86d8e8e0, func=0x7f2669b61620) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #49 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8e8e0) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #50 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #51 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0,
> > > > > > >     kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
> > > > > > > #52 0x00007f266938a1b5 in function_call (func=0x7f2669b616a8,
> > > > > > > arg=0x7f25f9f2cf48, kw=0x0) at Objects/funcobject.c:632
> > > > > > > #53 0x00007f266935e8be in PyObject_Call (func=0x7f2669b616a8,
> > > > > > > arg=<optimized out>, kw=<optimized out>) at
> > Objects/abstract.c:2040
> > > > > > > #54 0x00007f266935f2cd in _PyObject_CallMethodIdObjArgs
> > > > > > > (callable=0x7f2669b616a8, name=<optimized out>) at
> > > > > Objects/abstract.c:2310
> > > > > > > #55 0x00007f26694427bc in PyImport_ImportModuleLevelObject
> > > > > > > (name=0x7f25fb28aa78, given_globals=<optimized out>,
> > locals=<optimized
> > > > > > > out>, given_fromlist=<optimized out>, level=0) at
> > Python/import.c:1636
> > > > > > > #56 0x00007f266941dcaf in builtin___import__ (self=<optimized
> > out>,
> > > > > > > args=<optimized out>, kwds=<optimized out>) at
> > Python/bltinmodule.c:210
> > > > > > > #57 0x00007f266935e8be in PyObject_Call (func=0x7f2669c053c8,
> > > > > > > arg=<optimized out>, kw=<optimized out>) at
> > Objects/abstract.c:2040
> > > > > > > #58 0x00007f2669420227 in PyEval_CallObjectWithKeywords
> > > > > > > (func=0x7f2669c053c8, arg=0x7f25fa0dfc50, kw=<optimized out>) at
> > > > > > > Python/ceval.c:4114
> > > > > > > #59 0x00007f2669421907 in PyEval_EvalFrameEx (f=<optimized out>,
> > > > > > > throwflag=<optimized out>) at Python/ceval.c:2478
> > > > > > > #60 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=2, kws=0x7f25f9f3e3d0, kwcount=2,
> > > > > > >     defs=0x7f25fa5ed360, defcount=2, kwdefs=0x0, closure=0x0) at
> > > > > > > Python/ceval.c:3588
> > > > > > > #61 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8eff0,
> > func=0x7f25fa5efe18)
> > > > > at
> > > > > > > Python/ceval.c:4344
> > > > > > > #62 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8eff0) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #63 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #64 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=3, kws=0x7f2669bd5060, kwcount=0, defs=0x0,
> > > > > > >     defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
> > > > > > > #65 0x00007f266938a2bc in function_call (func=0x7f25fa5efd08,
> > > > > > > arg=0x7f25fa0f4ab0, kw=0x7f25f9f2cdc8) at
> > Objects/funcobject.c:632
> > > > > > > #66 0x00007f25fc7ae500 in __Pyx_PyObject_Call (kw=0x7f25f9f2cdc8,
> > > > > > > arg=<optimized out>, func=<optimized out>) at
> > > > > > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_
> > > > > objects.c:7420
> > > > > > > #67 __pyx_pf_4h5py_8_objects_9with_phil_wrapper
> > > > > > > (__pyx_v_kwds=0x7f25f9f2cdc8, __pyx_v_args=0x7f25fa0f4ab0,
> > > > > > > __pyx_self=<optimized out>)
> > > > > > >     at
> > > > > > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_
> > > > > objects.c:2429
> > > > > > > #68 __pyx_pw_4h5py_8_objects_9with_phil_1wrapper
> > > > > (__pyx_self=<optimized
> > > > > > > out>, __pyx_args=0x7f25fa0f4ab0, __pyx_kwds=<optimized out>)
> > > > > > >     at
> > > > > > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_
> > > > > objects.c:2345
> > > > > > > #69 0x00007f266935e8be in PyObject_Call (func=0x7f25fa5ec498,
> > > > > > > arg=<optimized out>, kw=<optimized out>) at
> > Objects/abstract.c:2040
> > > > > > > #70 0x00007f2669375354 in method_call (func=0x7f25fa5ec498,
> > > > > > > arg=0x7f25fa0f4ab0, kw=0x0) at Objects/classobject.c:347
> > > > > > > #71 0x00007f266935e8be in PyObject_Call (func=0x7f2663d3a048,
> > > > > > > arg=<optimized out>, kw=<optimized out>) at
> > Objects/abstract.c:2040
> > > > > > > #72 0x00007f26693be3ec in call_method (o=<optimized out>,
> > > > > > > nameid=0x7f266974d0d0 <PyId___setitem__>, format=0x7f26694e7609
> > > > > "(OO)") at
> > > > > > > Objects/typeobject.c:1429
> > > > > > > #73 0x00007f26693be541 in slot_mp_ass_subscript (self=<optimized
> > out>,
> > > > > > > key=<optimized out>, value=<optimized out>) at
> > > > > Objects/typeobject.c:5751
> > > > > > > #74 0x00007f2669424d0d in PyEval_EvalFrameEx (f=<optimized out>,
> > > > > > > throwflag=<optimized out>) at Python/ceval.c:1831
> > > > > > > #75 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=3, kws=0x7f25f9f27e68, kwcount=0,
> > > > > > >     defs=0x7f25fb259300, defcount=1, kwdefs=0x0, closure=0x0) at
> > > > > > > Python/ceval.c:3588
> > > > > > > #76 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8f700,
> > func=0x7f25fa0db730)
> > > > > at
> > > > > > > Python/ceval.c:4344
> > > > > > > #77 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8f700) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #78 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #79 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=3, pp_stack=0x7fff86d8f840, func=0x7f25fa0db6a8) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #80 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8f840) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #81 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #82 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=5, kws=0x7f25f424f280, kwcount=0,
> > > > > > >     defs=0x7f25fe2d1f08, defcount=1, kwdefs=0x0, closure=0x0) at
> > > > > > > Python/ceval.c:3588
> > > > > > > #83 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8fa30,
> > func=0x7f25fa0d8e18)
> > > > > at
> > > > > > > Python/ceval.c:4344
> > > > > > > #84 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8fa30) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #85 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #86 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=3, pp_stack=0x7fff86d8fb70, func=0x7f25fa0e1620) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #87 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8fb70) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #88 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #89 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=3, pp_stack=0x7fff86d8fcb0, func=0x7f25fa0fab70) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #90 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8fcb0) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #91 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #92 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=3, pp_stack=0x7fff86d8fdf0, func=0x7f25fe2c16a8) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #93 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8fdf0) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #94 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #95 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=1, pp_stack=0x7fff86d8ff30, func=0x7f25fe2c1620) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #96 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d8ff30) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #97 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #98 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=4, pp_stack=0x7fff86d90070, func=0x7f25fa0fc048) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #99 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d90070) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #100 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #101 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> > > > > na=<optimized
> > > > > > > out>, n=1, pp_stack=0x7fff86d901b0, func=0x7f25fe2c19d8) at
> > > > > > > Python/ceval.c:4334
> > > > > > > #102 call_function (oparg=<optimized out>,
> > pp_stack=0x7fff86d901b0) at
> > > > > > > Python/ceval.c:4262
> > > > > > > #103 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized
> > out>)
> > > > > at
> > > > > > > Python/ceval.c:2838
> > > > > > > #104 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> > out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized
> > out>,
> > > > > > > argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0,
> > > > > > >     kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
> > > > > > > #105 0x00007f2669428dbb in PyEval_EvalCode (co=<optimized out>,
> > > > > > > globals=<optimized out>, locals=<optimized out>) at
> > Python/ceval.c:775
> > > > > > > #106 0x00007f266944f770 in run_mod (arena=0x1a62b70,
> > > > > flags=0x7fff86d904e0,
> > > > > > > locals=0x7f2669b7a448, globals=0x7f2669b7a448,
> > filename=0x7f2669ab7b20,
> > > > > > > mod=<optimized out>) at Python/pythonrun.c:2180
> > > > > > > #107 PyRun_FileExFlags (fp=0x1ab3c20, filename_str=<optimized
> > out>,
> > > > > > > start=<optimized out>, globals=0x7f2669b7a448,
> > locals=0x7f2669b7a448,
> > > > > > > closeit=1, flags=0x7fff86d904e0) at Python/pythonrun.c:2133
> > > > > > > #108 0x00007f26694504b9 in PyRun_SimpleFileExFlags (fp=0x1ab3c20,
> > > > > > > filename=<optimized out>, closeit=1, flags=0x7fff86d904e0) at
> > > > > > > Python/pythonrun.c:1606
> > > > > > > #109 0x00007f2669468231 in run_file (p_cf=0x7fff86d904e0,
> > > > > > > filename=0x19f5830
> > > > > > > L"/global/homes/r/regan/.local/bin/read_fast5_basecaller.py",
> > > > > fp=0x1ab3c20)
> > > > > > > at Modules/main.c:319
> > > > > > > #110 Py_Main (argc=<optimized out>, argv=<optimized out>) at
> > > > > > > Modules/main.c:751
> > > > > > > #111 0x0000000000400ae3 in main (argc=14, argv=0x7fff86d90658) at
> > > > > > > ./Modules/python.c:69
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Jul 29, 2017 at 7:21 AM, Rohan Garg <rohg...@ccs.neu.edu
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > I just realized that there's an open issue related to Python
> > and
> > > > > > > > Numpy.  See here: https://github.com/dmtcp/dmtcp/issues/599.
> > > > > > > > Unfortunately, I haven't been able to spend enough time to get
> > a
> > > > > > > > fix out.
> > > > > > > >
> > > > > > > > Could you please verify if this isn't what's going on? Even if
> > it's
> > > > > > > > not the exact same libraries, I think the bug, as reported on
> > the
> > > > > > > > Github issue, is more general and could manifest through
> > different
> > > > > > > > libraries, although with a similar backtrace signature.
> > > > > > > >
> > > > > > > > On Fri, Jul 28, 2017 at 09:53:40PM -0400, Rohan Garg wrote:
> > > > > > > > > Hi Rob,
> > > > > > > > >
> > > > > > > > > Thanks for trying out DMTCP and reporting this issue. We'd be
> > > > > happy to
> > > > > > > > > help resolve this issue.
> > > > > > > > >
> > > > > > > > > I'll take a look at the strace, but probably the most useful
> > thing
> > > > > > > > > here would be to try to figure out the process that's
> > hanging and
> > > > > > > > > to look at its backtrace. The backtrace from the parent
> > could also
> > > > > > > > > be useful. You should be able to attach gdb to the
> > interesting
> > > > > > > > > processes and look at the backtrace.
> > > > > > > > >
> > > > > > > > >     $ gdb attach <pid>
> > > > > > > > >     ...
> > > > > > > > >     (gdb) thread apply all bt
> > > > > > > > >
> > > > > > > > > This can hopefully tell us the reason for the deadlock.
> > Also, as a
> > > > > quick
> > > > > > > > > test could you please try running your program with the
> > following
> > > > > DMTCP
> > > > > > > > > options?
> > > > > > > > >
> > > > > > > > >     $ dmtcp_launch --disable-alloc-plugin --disable-dl-plugin
> > > > > > > > <program-and-program-arguments>
> > > > > > > > >
> > > > > > > > > Also, I'd recommend working with the 2.4.x/2.5.x versions;
> > we are
> > > > > > > > > not maintaining the 1.x branch any longer.
> > > > > > > > >
> > > > > > > > > (I do have a NERSC account and have recently diagnosed some
> > issues
> > > > > > > > > with the CRAY toolchain/srun and DMTCP there. But it seems
> > like
> > > > > > > > > that that's not an issue here.)
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Rohan
> > > > > > > > >
> > > > > > > > > On Fri, Jul 28, 2017 at 06:28:03PM -0700, Rob Egan wrote:
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > I am trying to adopt dmtcp at our institution and ran into
> > a
> > > > > problem
> > > > > > > > with
> > > > > > > > > > the first script that I tried.  The script seems to start
> > fine
> > > > > and
> > > > > > > > perform
> > > > > > > > > > a few operations, but then stalls indefinitely consuming 0
> > cpu
> > > > > in what
> > > > > > > > > > looks like a futex wait loop after the process was
> > cloned.  It
> > > > > is a
> > > > > > > > > > multi-threaded program and the behavior is the same
> > whether 1 or
> > > > > more
> > > > > > > > > > threads is requested.
> > > > > > > > > >
> > > > > > > > > > This is a 3rd party python script, so unfortunately I can't
> > > > > share the
> > > > > > > > code
> > > > > > > > > > but I uploaded the strace from a 1-threaded execution:
> > > > > > > > > >
> > > > > > > > > > dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
> > > > > > > > > > <http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-
> > > > > > > > read_fast5_basecaller.py-strace.log.gz>
> > > > > > > > > > http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-
> > > > > > > > read_fast5_basecaller.py-strace.log.gz
> > > > > > > > > >
> > > > > > > > > > This is how I executed the dmctp & strace & script:
> > > > > > > > > >
> > > > > > > > > > dmtcp_launch  strace ~/.local/bin/read_fast5_basecaller.py
> > -i
> > > > > > > > uploaded/0/
> > > > > > > > > > -o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107
> > --kit
> > > > > > > > SQK-LSK308
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > And here is the output from the dmtcp_coordinator (which I
> > > > > commanded to
> > > > > > > > > > kill after a few minutes of inactivity):
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > regan@gpint200:/global/projectb/scratch/regan/
> > > > > > > > nanopore/runs/X0124-onebatch2$
> > > > > > > > > > dmtcp_coordinator
> > > > > > > > > > dmtcp_coordinator starting...
> > > > > > > > > >     Host: gpint200 (127.0.0.2)
> > > > > > > > > >     Port: 7779
> > > > > > > > > >     Checkpoint Interval: 18000
> > > > > > > > > >     Exit on last client: 0
> > > > > > > > > > Type '?' for help.
> > > > > > > > > >
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1675 in
> > > > > updateCheckpointInterval;
> > > > > > > > > > REASON='CheckpointInterval updated (for this computation
> > only)'
> > > > > > > > > >      oldInterval = 18000
> > > > > > > > > >      theCheckpointInterval = 18000
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect;
> > > > > REASON='worker
> > > > > > > > > > connected'
> > > > > > > > > >      hello_remote.from = 704891f6494a6304-10275-597be341
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData;
> > > > > REASON='Updating
> > > > > > > > > > process Information after exec()'
> > > > > > > > > >      progname = strace
> > > > > > > > > >      msg.from = 704891f6494a6304-40000-597be341
> > > > > > > > > >      client->identity() = 704891f6494a6304-10275-597be341
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect;
> > > > > REASON='worker
> > > > > > > > > > connected'
> > > > > > > > > >      hello_remote.from = 704891f6494a6304-40000-597be341
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData;
> > > > > REASON='Updating
> > > > > > > > > > process Information after fork()'
> > > > > > > > > >      client->hostname() = gpint200
> > > > > > > > > >      client->progname() = strace_(forked)
> > > > > > > > > >      msg.from = 704891f6494a6304-41000-597be341
> > > > > > > > > >      client->identity() = 704891f6494a6304-40000-597be341
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData;
> > > > > REASON='Updating
> > > > > > > > > > process Information after exec()'
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect;
> > > > > REASON='worker
> > > > > > > > > > connected'
> > > > > > > > > >      hello_remote.from = 704891f6494a6304-40000-597be341
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData;
> > > > > REASON='Updating
> > > > > > > > > > process Information after fork()'
> > > > > > > > > >      client->hostname() = gpint200
> > > > > > > > > >      client->progname() = strace_(forked)
> > > > > > > > > >      msg.from = 704891f6494a6304-41000-597be341
> > > > > > > > > >      client->identity() = 704891f6494a6304-40000-597be341
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData;
> > > > > REASON='Updating
> > > > > > > > > > process Information after exec()'
> > > > > > > > > >      progname = python3.4
> > > > > > > > > >      msg.from = 704891f6494a6304-41000-597be341
> > > > > > > > > >      client->identity() = 704891f6494a6304-41000-597be341
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect;
> > > > > REASON='worker
> > > > > > > > > > connected'
> > > > > > > > > >      hello_remote.from = 704891f6494a6304-41000-597be341
> > > > > > > > > > k
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:603 in
> > handleUserCommand;
> > > > > > > > > > REASON='Killing all connected Peers...'
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect;
> > > > > > > > REASON='client
> > > > > > > > > > disconnected'
> > > > > > > > > >      client->identity() = 704891f6494a6304-40000-597be341
> > > > > > > > > >      client->progname() = strace
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect;
> > > > > > > > REASON='client
> > > > > > > > > > disconnected'
> > > > > > > > > >      client->identity() = 704891f6494a6304-41000-597be341
> > > > > > > > > >      client->progname() = python3.4_(forked)
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect;
> > > > > > > > REASON='client
> > > > > > > > > > disconnected'
> > > > > > > > > >      client->identity() = 704891f6494a6304-41000-597be341
> > > > > > > > > >      client->progname() = python3.4
> > > > > > > > > > l
> > > > > > > > > > Client List:
> > > > > > > > > > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> > > > > > > > > > q
> > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:590 in
> > handleUserCommand;
> > > > > > > > > > REASON='killing all connected peers and quitting ...'
> > > > > > > > > > DMTCP coordinator exiting... (per request)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This is a process that on large data sets can take a few
> > days on
> > > > > a 32
> > > > > > > > core
> > > > > > > > > > node, so it is something that we would really like to be
> > able to
> > > > > > > > checkpoint
> > > > > > > > > > on our cluster.
> > > > > > > > > >
> > > > > > > > > > Please let me know how I can help debug this with you.
> > I've
> > > > > tried
> > > > > > > > versions
> > > > > > > > > > 1.2.5 and 2.4.8 and 2.5 and they all have the same problem.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Rob Egan
> > > > > > > > >
> > > > > > > > > > ------------------------------
> > ------------------------------
> > > > > > > > ------------------
> > > > > > > > > > Check out the vibrant tech community on one of the world's
> > most
> > > > > > > > > > engaging tech sites, Slashdot.org!
> > http://sdm.link/slashdot
> > > > > > > > >
> > > > > > > > > > _______________________________________________
> > > > > > > > > > Dmtcp-forum mailing list
> > > > > > > > > > Dmtcp-forum@lists.sourceforge.net
> > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ------------------------------------------------------------
> > > > > > > > ------------------
> > > > > > > > > Check out the vibrant tech community on one of the world's
> > most
> > > > > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > > > > > > > > _______________________________________________
> > > > > > > > > Dmtcp-forum mailing list
> > > > > > > > > Dmtcp-forum@lists.sourceforge.net
> > > > > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > > > > > > >
> > > > > >
> > > > > > > ------------------------------------------------------------
> > > > > ------------------
> > > > > > > Check out the vibrant tech community on one of the world's most
> > > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > > > > >
> > > > > > > _______________________________________________
> > > > > > > Dmtcp-forum mailing list
> > > > > > > Dmtcp-forum@lists.sourceforge.net
> > > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > > > > >
> > > > > >
> > > > > > ------------------------------------------------------------
> > > > > ------------------
> > > > > > Check out the vibrant tech community on one of the world's most
> > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > > > > > _______________________________________________
> > > > > > Dmtcp-forum mailing list
> > > > > > Dmtcp-forum@lists.sourceforge.net
> > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > > > >
> > >
> > > > ------------------------------------------------------------
> > ------------------
> > > > Check out the vibrant tech community on one of the world's most
> > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > >
> > > > _______________________________________________
> > > > Dmtcp-forum mailing list
> > > > Dmtcp-forum@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > >
> >
> >
> >
diff --git a/src/execwrappers.cpp b/src/execwrappers.cpp
index ea9e66b..1097b07 100644
--- a/src/execwrappers.cpp
+++ b/src/execwrappers.cpp
@@ -174,7 +174,7 @@ extern "C" pid_t fork()
   /* Acquire the wrapperExeution lock to prevent checkpoint to happen while
    * processing this system call.
    */
-  WRAPPER_EXECUTION_GET_EXCL_LOCK();
+ // WRAPPER_EXECUTION_GET_EXCL_LOCK();
   DmtcpWorker::eventHook(DMTCP_EVENT_ATFORK_PREPARE, NULL);
 
   /* Little bit cheating here: child_time should be same for both parent and
@@ -226,7 +226,7 @@ extern "C" pid_t fork()
   if (childPid != 0) {
     coordinatorAPI.closeConnection();
     DmtcpWorker::eventHook(DMTCP_EVENT_ATFORK_PARENT, NULL);
-    WRAPPER_EXECUTION_RELEASE_EXCL_LOCK();
+    // WRAPPER_EXECUTION_RELEASE_EXCL_LOCK();
   }
   return childPid;
 }
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to