Hi Rob,

The risk that the patch has is that if a checkpoint is created in
the middle of a fork(), there may be some inconsistency. But I'm
not sure if it is the case you were describing. As I understand,
the original hanging was observed during the program launching time,
and once it passed the initialization phase (because of the patch),
there wouldn't be many calls to fork() anymore. Please correct me
if I'm wrong.

Also, I don't completely understand the issue. What does the
acknowledgment mean? Should that be handled by the application itself?
Could you post the exact error message?

Best,
Jiajun

On Wed, Aug 09, 2017 at 04:58:21PM -0700, Rob Egan wrote:
> Hi Jiajun,
> This is just an FYI.  Unsurprisingly, this patch seems to be interfering
> with a proper checkpoint / restart, and I think it is because there is now
> a race between the states of the threads at the time of the checkpoint, so
> the state of the whole program across the threads is inconsistent.
> Basically, the way I understand the script is that it creates a pool of
> workers to process a list of file and eventually move them to a new
> directory.  After restart at least one thread can't find the file anymore,
> because another thread had already moved it to its new place, but
> presumably it never got the acknowledgment from the worker thread.
> -Rob
> 
> On Mon, Aug 7, 2017 at 4:59 PM, Rob Egan <rse...@lbl.gov> wrote:
> 
> > I can confirm that there is no hang after applying this patch.
> > Thanks,
> > Rob
> >
> > On Mon, Aug 7, 2017 at 12:04 PM, Jiajun Cao <jia...@ccs.neu.edu> wrote:
> >
> >> Hi Rob,
> >>
> >> Could you please try the attached patch? Basically it removes
> >> the lock/unlock operations in the fork() wrapper.
> >>
> >> Note this is a temp solution that helps us to better understand
> >> the bug. If you do not see the hanging anymore, we'll think about
> >> a more elegant fix.
> >>
> >> Best,
> >> Jiajun
> >>
> >> On Fri, Aug 04, 2017 at 03:56:05PM -0700, Rob Egan wrote:
> >> > Hi Rohan,
> >> > Great!  Thank you.  I got your email at 11:23 AM and this one at 3:15PM.
> >> > I'll watch that bug and test the patch when it is available too.
> >> > Best,
> >> > Rob
> >> >
> >> > On Fri, Aug 4, 2017 at 3:15 PM, Rohan Garg <rohg...@ccs.neu.edu> wrote:
> >> >
> >> > > Hi Rob,
> >> > >
> >> > > For some reason, the DMTCP forum on SourceForge is acting weird;
> >> > > it rejected my last message. So I'm not sure if you received my
> >> > > last e-mail.
> >> > >
> >> > > We are looking into this actively. Jiajun has also opened an issue
> >> > > on Github (see: https://github.com/dmtcp/dmtcp/issues/612), which
> >> > > we believe is the reason for the bug that you are seeing. Hopefully,
> >> we
> >> > > should have a fix soon.
> >> > >
> >> > > -Rohan
> >> > >
> >> > > On Fri, Aug 04, 2017 at 02:23:43PM -0400, Rohan Garg wrote:
> >> > > > Hi Rob,
> >> > > >
> >> > > >
> >> > > > I was discussing this with Jiajun, and it seems like we might be
> >> able to
> >> > > > write a simple test program to simulate this bug. This will
> >> definitely
> >> > > > require a fix in DMTCP. Hopefully, we can provide a patch to you
> >> soon.
> >> > > >
> >> > > > I'll try to work on it this weekend. We'll see if a screen sharing
> >> > > > session is necessary.
> >> > > >
> >> > > > Thanks,
> >> > > > Rohan
> >> > > >
> >> > > > On Tue, Aug 01, 2017 at 11:00:50PM -0700, Rob Egan wrote:
> >> > > > > Thanks Jiajun & Rohan!
> >> > > > >
> >> > > > > So in that trace before threads 2 through 41 are all identical, so
> >> > > almost
> >> > > > > all the threads are that nanosleep.  Since this is third party
> >> > > software, my
> >> > > > > ability to change the environment and code is quite limited and
> >> python
> >> > > 3.4
> >> > > > > is the only version that I was able to get working in this
> >> environment.
> >> > > > >
> >> > > > > Jiajun's explanation sounds like the issue to me.  What would be
> >> the
> >> > > next
> >> > > > > steps to address it?  Would we need to patch dmtcp? openblas?
> >> > > > >
> >> > > > > I'd be happy to arrange a screen sharing / zoom / hangouts session
> >> > > with one
> >> > > > > or more of you.
> >> > > > >
> >> > > > > -Rob
> >> > > > >
> >> > > > > On Tue, Aug 1, 2017 at 9:37 AM, Jiajun Cao <jia...@ccs.neu.edu>
> >> wrote:
> >> > > > >
> >> > > > > > Hi Rob,
> >> > > > > >
> >> > > > > > I know what's going on here after browsing some blas
> >> > > > > > source code. As Rohan mentioned, thread 42 is behaving normally.
> >> > > > > > What looks more interesting is the calling stack of thread 1,
> >> the
> >> > > > > > main thread, especially the bottom 5 frames. It seems that the
> >> blas
> >> > > > > > library registers some handler at fork, using pthread_atfork():
> >> > > > > >
> >> > > > > > https://github.com/xianyi/OpenBLAS/blob/
> >> > > e70a6b92bf9f69f113e198e3ded558
> >> > > > > > ed0675b038/driver/others/memory.c
> >> > > > > >
> >> > > > > > The handler tries to join a bunch of other blas threads, in this
> >> > > case,
> >> > > > > > including thread 41.
> >> > > > > >
> >> > > > > > The problem is, DMTCP has its internal synchronization
> >> semantics.
> >> > > > > > When a thread exits, it tries to grab a lock to preventing a
> >> > > checkpoint
> >> > > > > > from happening: it is an internal critical section. On the other
> >> > > hand,
> >> > > > > > when the application calls fork(), it also grabs the lock before
> >> > > fork()
> >> > > > > > returns: another internal critical section. That's why we see a
> >> > > deadlock
> >> > > > > > here:
> >> > > > > >
> >> > > > > > thread 1: lock acquired, and waiting for thread 41 to finish
> >> > > > > > thread 41: trying to acquire the lock before it exits
> >> > > > > >
> >> > > > > > Best,
> >> > > > > > Jiajun
> >> > > > > >
> >> > > > > > On Mon, Jul 31, 2017 at 10:15:15PM -0700, Rohan Garg wrote:
> >> > > > > > > Hi Rob,
> >> > > > > > >
> >> > > > > > > Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in
> >> every
> >> > > > > > > user process that's launched using DMTCP. It seems to be
> >> simply
> >> > > > > > > waiting for a message from the coordinator, which is the
> >> normal
> >> > > > > > > state of things. It's not stuck or the cause of the deadlock
> >> here.
> >> > > > > > >
> >> > > > > > > The deadlock is definitely because of one or more user
> >> threads. It
> >> > > > > > > seems like the program also forks and execs some child
> >> processes.
> >> > > > > > > We have seen some corner cases around that but it's difficult
> >> to
> >> > > > > > > say what it is with the information below.
> >> > > > > > >
> >> > > > > > > The fact that even `--disable-all-plugins` didn't work is
> >> quite
> >> > > > > > > interesting.
> >> > > > > > >
> >> > > > > > > I have a few other suggestions:
> >> > > > > > >
> >> > > > > > >  - Is it possible to share perhaps the top 20-25 frames from
> >> each
> >> > > of
> >> > > > > > >    the threads? (You can also send them to us/me directly at:
> >> > > > > > >    dm...@ccs.neu.edu/rohg...@ccs.neu.edu)
> >> > > > > > >
> >> > > > > > >    Also, after you save the backtraces from all the threads,
> >> detach
> >> > > > > > >    from gdb, wait for a few seconds, re-attach, and save a
> >> second
> >> > > > > > >    set of backtraces. It'll help us figure out which threads
> >> are
> >> > > > > > >    really stuck.
> >> > > > > > >
> >> > > > > > >  - Is it possible to run with other Python versions (not 3.4)
> >> as a
> >> > > > > > >    quick test?
> >> > > > > > >
> >> > > > > > >  - What Numpy/OpenBLAS versions are you using? I'll try to run
> >> > > > > > >    a few experiments locally.
> >> > > > > > >
> >> > > > > > >  - Is it possible try to cut down your Python script to the
> >> bare
> >> > > > > > >    minimum, to isolate the issue? Perhaps just with a few
> >> imports
> >> > > > > > >    and then add code incrementally. But if it turns out to be
> >> more
> >> > > > > > >    work than its worth, then we should skip this suggestion.
> >> > > > > > >
> >> > > > > > >  - Finally, if it's possible, we could arrange for a screen
> >> sharing
> >> > > > > > >    session. I think it'll be more efficient for interactively
> >> > > > > > >    debugging this.
> >> > > > > > >
> >> > > > > > > -Rohan
> >> > > > > > >
> >> > > > > > > On Mon, Jul 31, 2017 at 03:30:55PM -0700, Rob Egan wrote:
> >> > > > > > > > Hi Rohan,
> >> > > > > > > >
> >> > > > > > > > Thanks for the quick reply.  So using version 2.4.8 and the
> >> > > > > > > > --disable-alloc-plugin and --disable-dl-plugin the behavior
> >> is
> >> > > the
> >> > > > > > same.  I
> >> > > > > > > > also tried --disable-all-plugins and it still did not work,
> >> so I
> >> > > think
> >> > > > > > this
> >> > > > > > > > may be different from the numpy bug you identified, though
> >> as
> >> > > you will
> >> > > > > > will
> >> > > > > > > > see the stack trace does show one of the threads in the
> >> numpy /
> >> > > > > > openblas
> >> > > > > > > > library.
> >> > > > > > > >
> >> > > > > > > > Though we are running at NERSC, I am not running this on
> >> the cray
> >> > > > > > systems
> >> > > > > > > > (yet) and not through slurm (yet).
> >> > > > > > > >
> >> > > > > > > > When I attach to the running process, unfortunately there
> >> are a
> >> > > ton of
> >> > > > > > > > threads and a deep python stack.  Apparently even though I
> >> only
> >> > > > > > requested 1
> >> > > > > > > > thread, the python library allocates a ton of them.  I
> >> tried to
> >> > > remove
> >> > > > > > > > duplicates manually for you.  It looks like thread 42 and 1
> >> are
> >> > > the
> >> > > > > > > > interesting ones.  2-41 are all the same.  It looks to me
> >> like
> >> > > > > > > > dmtcp_coordinator is interfering with the blas library's
> >> > > pthread_join
> >> > > > > > calls.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Thread 42 (Thread 0x7f266738e700 (LWP 14136)):
> >> > > > > > > > #0  0x00007f26690ed14d in read () from /lib/libpthread.so.0
> >> > > > > > > > #1  0x00007f2669804224 in _real_read (fd=<optimized out>,
> >> > > > > > buf=<optimized
> >> > > > > > > > out>, count=<optimized out>) at syscallsreal.c:504
> >> > > > > > > > #2  0x00007f26697eae80 in dmtcp::Util::readAll (fd=821,
> >> > > > > > buf=0x7f266738bc00,
> >> > > > > > > > count=136) at util_misc.cpp:279
> >> > > > > > > > #3  0x00007f26697c303b in operator>><dmtcp::DmtcpMessage>
> >> > > (t=...,
> >> > > > > > > > this=0x7f2669c33408) at ../jalib/jsocket.h:109
> >> > > > > > > > #4  dmtcp::CoordinatorAPI::recvMsgFromCoordinator
> >> > > > > > (this=0x7f2669c33408,
> >> > > > > > > > msg=0x7f266738bc00, extraData=0x0) at coordinatorapi.cpp:422
> >> > > > > > > > #5  0x00007f26697bb9fc in dmtcp::DmtcpWorker::
> >> > > waitForCoordinatorMsg
> >> > > > > > > > (msgStr=..., type=dmtcp::DMT_DO_SUSPEND) at
> >> dmtcpworker.cpp:469
> >> > > > > > > > #6  0x00007f26697bc58b in dmtcp::DmtcpWorker::waitForSta
> >> ge1Suspend
> >> > > ()
> >> > > > > > at
> >> > > > > > > > dmtcpworker.cpp:513
> >> > > > > > > > #7  0x00007f26697ccd1a in dmtcp::callbackSleepBetweenChe
> >> ckpoint
> >> > > > > > > > (sec=<optimized out>) at mtcpinterface.cpp:61
> >> > > > > > > > #8  0x00007f26697d85e9 in checkpointhread (dummy=<optimized
> >> > > out>) at
> >> > > > > > > > threadlist.cpp:359
> >> > > > > > > > #9  0x00007f26697cfb9e in pthread_start (arg=<optimized
> >> out>) at
> >> > > > > > > > threadwrappers.cpp:159
> >> > > > > > > > #10 0x00007f26690e58ca in start_thread () from
> >> > > /lib/libpthread.so.0
> >> > > > > > > > #11 0x00007f26697cf9c9 in clone_start (arg=0x7f2669c15008)
> >> at
> >> > > > > > > > threadwrappers.cpp:68
> >> > > > > > > > #12 0x00007f26687c3abd in clone () from /lib/libc.so.6
> >> > > > > > > > #13 0x0000000000000000 in ?? ()
> >> > > > > > > >
> >> > > > > > > > Thread 41 (Thread 0x7f2660f99700 (LWP 14137)):
> >> > > > > > > > #0  0x00007f26690ed86d in nanosleep () from
> >> /lib/libpthread.so.0
> >> > > > > > > > #1  0x00007f26697bf599 in dmtcp::ThreadSync::
> >> > > wrapperExecutionLockLock
> >> > > > > > () at
> >> > > > > > > > threadsync.cpp:392
> >> > > > > > > > #2  0x00007f26697cfba6 in pthread_start (arg=<optimized
> >> out>) at
> >> > > > > > > > threadwrappers.cpp:161
> >> > > > > > > > #3  0x00007f26690e58ca in start_thread () from
> >> > > /lib/libpthread.so.0
> >> > > > > > > > #4  0x00007f26697cf9c9 in clone_start (arg=0x7f2669c24008)
> >> at
> >> > > > > > > > threadwrappers.cpp:68
> >> > > > > > > > #5  0x00007f26687c3abd in clone () from /lib/libc.so.6
> >> > > > > > > > #6  0x0000000000000000 in ?? ()
> >> > > > > > > >
> >> > > > > > > > Thread 40 (Thread 0x7f2660798700 (LWP 14138)):
> >> > > > > > > > ...
> >> > > > > > > >
> >> > > > > > > > Thread 1 (Thread 0x7f2669c26740 (LWP 14131)):
> >> > > > > > > > #0  0x00007f26690ece69 in __lll_timedwait_tid () from
> >> > > > > > /lib/libpthread.so.0
> >> > > > > > > > #1  0x00007f26690e6ebf in pthread_timedjoin_np () from
> >> > > > > > /lib/libpthread.so.0
> >> > > > > > > > #2  0x00007f26697d0048 in pthread_join
> >> (thread=139802812454656,
> >> > > > > > retval=0x0)
> >> > > > > > > > at threadwrappers.cpp:274
> >> > > > > > > > #3  0x00007f26615cd2d7 in blas_thread_shutdown_ () from
> >> > > > > > > > /global/homes/r/regan/.local/lib/python3.4/site-packages/
> >> > > > > > numpy/core/../.libs/
> >> > > > > > > > libopenblasp-r0-39a31c03.2.18.so
> >> > > > > > > > #4  0x00007f26687936d5 in fork () from /lib/libc.so.6
> >> > > > > > > > #5  0x00007f26697c9724 in fork () at execwrappers.cpp:186
> >> > > > > > > > #6  0x00007f2665b9ed7f in subprocess_fork_exec
> >> (self=<optimized
> >> > > out>,
> >> > > > > > > > args=<optimized out>) at
> >> > > > > > > > /global/common/genepool/usg/languages/python/build/Python-
> >> > > > > > 3.4.3/Modules/_posixsubprocess.c:649
> >> > > > > > > > #7  0x00007f2669427225 in call_function (oparg=<optimized
> >> out>,
> >> > > > > > > > pp_stack=0x7fff86d8d690) at Python/c
> >> > > > > > > > eval.c:4237
> >> > > > > > > > #8  PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #9  0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=19, kws=0x7f25f4014ad8, kwcount=0, defs=0x0,
> >> > > > > > > >     defcount=0, kwdefs=0x0, closure=0x0) at
> >> Python/ceval.c:3588
> >> > > > > > > > #10 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8d880,
> >> > > func=0x7f2669a60ea0)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:4344
> >> > > > > > > > #11 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8d880) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #12 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #13 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=2, kws=0x7f25fa0dda70, kwcount=4,
> >> > > > > > > >     defs=0x7f2669a55110, defcount=16, kwdefs=0x0,
> >> closure=0x0) at
> >> > > > > > > > Python/ceval.c:3588
> >> > > > > > > > #14 0x00007f266938a2bc in function_call
> >> (func=0x7f2669a60840,
> >> > > > > > > > arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at
> >> > > Objects/funcobject.c:632
> >> > > > > > > > #15 0x00007f266935e8be in PyObject_Call
> >> (func=0x7f2669a60840,
> >> > > > > > > > arg=<optimized out>, kw=<optimized out>) at
> >> > > Objects/abstract.c:2040
> >> > > > > > > > #16 0x00007f2669375354 in method_call (func=0x7f2669a60840,
> >> > > > > > > > arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at
> >> > > Objects/classobject.c:347
> >> > > > > > > > #17 0x00007f266935e8be in PyObject_Call
> >> (func=0x7f2663ff4148,
> >> > > > > > > > arg=<optimized out>, kw=<optimized out>) at
> >> > > Objects/abstract.c:2040
> >> > > > > > > > #18 0x00007f26693c15f7 in slot_tp_init (self=0x7f25f9f45198,
> >> > > > > > > > args=0x7f25f9f45128, kwds=0x7f25f9f46208) at
> >> > > Objects/typeobject.c:6177
> >> > > > > > > > #19 0x00007f26693bd17f in type_call (type=<optimized out>,
> >> > > > > > > > args=0x7f25f9f45128, kwds=0x7f25f9f46208) at
> >> > > Objects/typeobject.c:898
> >> > > > > > > > #20 0x00007f266935e8be in PyObject_Call (func=0x1b8c0a8,
> >> > > arg=<optimized
> >> > > > > > > > out>, kw=<optimized out>) at Objects/abstract.c:2040
> >> > > > > > > > #21 0x00007f2669425a06 in do_call (nk=<optimized out>,
> >> > > na=<optimized
> >> > > > > > out>,
> >> > > > > > > > pp_stack=0x7fff86d8dc80, func=0x1b8c0a8) at
> >> Python/ceval.c:4466
> >> > > > > > > > #22 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8dc80) at
> >> > > > > > > > Python/ceval.c:4264
> >> > > > > > > > #23 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #24 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=1, pp_stack=0x7fff86d8ddc0, func=0x7f25f9e96bf8) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #25 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8ddc0) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #26 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #27 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=1, pp_stack=0x7fff86d8df00, func=0x7f25f9e96c80) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #28 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8df00) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #29 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #30 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0,
> >> > > > > > > >     kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
> >> > > > > > > > #31 0x00007f2669428dbb in PyEval_EvalCode (co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>) at
> >> > > Python/ceval.c:775
> >> > > > > > > > #32 0x00007f266941d3d8 in builtin_exec (self=<optimized
> >> out>,
> >> > > > > > > > args=<optimized out>) at Python/bltinmodule.c:882
> >> > > > > > > > #33 0x00007f2669426247 in ext_do_call (nk=0, na=0,
> >> > > flags=<optimized
> >> > > > > > out>,
> >> > > > > > > > pp_stack=0x7fff86d8e1f0, func=0x7f2669c05788) at
> >> > > Python/ceval.c:4558
> >> > > > > > > > #34 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2878
> >> > > > > > > > #35 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=3, kws=0x7f2663d13938, kwcount=0, defs=0x0,
> >> > > > > > > >     defcount=0, kwdefs=0x0, closure=0x0) at
> >> Python/ceval.c:3588
> >> > > > > > > > #36 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8e3e0,
> >> > > func=0x7f2669bd3b70)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:4344
> >> > > > > > > > #37 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8e3e0) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #38 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #39 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=2, pp_stack=0x7fff86d8e520, func=0x7f2669b5d158) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #40 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8e520) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #41 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #42 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=2, pp_stack=0x7fff86d8e660, func=0x7f2669b56ea0) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #43 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8e660) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #44 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #45 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=1, pp_stack=0x7fff86d8e7a0, func=0x7f2669b590d0) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #46 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8e7a0) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #47 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #48 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=2, pp_stack=0x7fff86d8e8e0, func=0x7f2669b61620) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #49 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8e8e0) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #50 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #51 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0,
> >> > > > > > > >     kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
> >> > > > > > > > #52 0x00007f266938a1b5 in function_call
> >> (func=0x7f2669b616a8,
> >> > > > > > > > arg=0x7f25f9f2cf48, kw=0x0) at Objects/funcobject.c:632
> >> > > > > > > > #53 0x00007f266935e8be in PyObject_Call
> >> (func=0x7f2669b616a8,
> >> > > > > > > > arg=<optimized out>, kw=<optimized out>) at
> >> > > Objects/abstract.c:2040
> >> > > > > > > > #54 0x00007f266935f2cd in _PyObject_CallMethodIdObjArgs
> >> > > > > > > > (callable=0x7f2669b616a8, name=<optimized out>) at
> >> > > > > > Objects/abstract.c:2310
> >> > > > > > > > #55 0x00007f26694427bc in PyImport_ImportModuleLevelObject
> >> > > > > > > > (name=0x7f25fb28aa78, given_globals=<optimized out>,
> >> > > locals=<optimized
> >> > > > > > > > out>, given_fromlist=<optimized out>, level=0) at
> >> > > Python/import.c:1636
> >> > > > > > > > #56 0x00007f266941dcaf in builtin___import__
> >> (self=<optimized
> >> > > out>,
> >> > > > > > > > args=<optimized out>, kwds=<optimized out>) at
> >> > > Python/bltinmodule.c:210
> >> > > > > > > > #57 0x00007f266935e8be in PyObject_Call
> >> (func=0x7f2669c053c8,
> >> > > > > > > > arg=<optimized out>, kw=<optimized out>) at
> >> > > Objects/abstract.c:2040
> >> > > > > > > > #58 0x00007f2669420227 in PyEval_CallObjectWithKeywords
> >> > > > > > > > (func=0x7f2669c053c8, arg=0x7f25fa0dfc50, kw=<optimized
> >> out>) at
> >> > > > > > > > Python/ceval.c:4114
> >> > > > > > > > #59 0x00007f2669421907 in PyEval_EvalFrameEx (f=<optimized
> >> out>,
> >> > > > > > > > throwflag=<optimized out>) at Python/ceval.c:2478
> >> > > > > > > > #60 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=2, kws=0x7f25f9f3e3d0, kwcount=2,
> >> > > > > > > >     defs=0x7f25fa5ed360, defcount=2, kwdefs=0x0,
> >> closure=0x0) at
> >> > > > > > > > Python/ceval.c:3588
> >> > > > > > > > #61 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8eff0,
> >> > > func=0x7f25fa5efe18)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:4344
> >> > > > > > > > #62 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8eff0) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #63 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #64 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=3, kws=0x7f2669bd5060, kwcount=0, defs=0x0,
> >> > > > > > > >     defcount=0, kwdefs=0x0, closure=0x0) at
> >> Python/ceval.c:3588
> >> > > > > > > > #65 0x00007f266938a2bc in function_call
> >> (func=0x7f25fa5efd08,
> >> > > > > > > > arg=0x7f25fa0f4ab0, kw=0x7f25f9f2cdc8) at
> >> > > Objects/funcobject.c:632
> >> > > > > > > > #66 0x00007f25fc7ae500 in __Pyx_PyObject_Call
> >> (kw=0x7f25f9f2cdc8,
> >> > > > > > > > arg=<optimized out>, func=<optimized out>) at
> >> > > > > > > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_
> >> > > > > > objects.c:7420
> >> > > > > > > > #67 __pyx_pf_4h5py_8_objects_9with_phil_wrapper
> >> > > > > > > > (__pyx_v_kwds=0x7f25f9f2cdc8, __pyx_v_args=0x7f25fa0f4ab0,
> >> > > > > > > > __pyx_self=<optimized out>)
> >> > > > > > > >     at
> >> > > > > > > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_
> >> > > > > > objects.c:2429
> >> > > > > > > > #68 __pyx_pw_4h5py_8_objects_9with_phil_1wrapper
> >> > > > > > (__pyx_self=<optimized
> >> > > > > > > > out>, __pyx_args=0x7f25fa0f4ab0, __pyx_kwds=<optimized out>)
> >> > > > > > > >     at
> >> > > > > > > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_
> >> > > > > > objects.c:2345
> >> > > > > > > > #69 0x00007f266935e8be in PyObject_Call
> >> (func=0x7f25fa5ec498,
> >> > > > > > > > arg=<optimized out>, kw=<optimized out>) at
> >> > > Objects/abstract.c:2040
> >> > > > > > > > #70 0x00007f2669375354 in method_call (func=0x7f25fa5ec498,
> >> > > > > > > > arg=0x7f25fa0f4ab0, kw=0x0) at Objects/classobject.c:347
> >> > > > > > > > #71 0x00007f266935e8be in PyObject_Call
> >> (func=0x7f2663d3a048,
> >> > > > > > > > arg=<optimized out>, kw=<optimized out>) at
> >> > > Objects/abstract.c:2040
> >> > > > > > > > #72 0x00007f26693be3ec in call_method (o=<optimized out>,
> >> > > > > > > > nameid=0x7f266974d0d0 <PyId___setitem__>,
> >> format=0x7f26694e7609
> >> > > > > > "(OO)") at
> >> > > > > > > > Objects/typeobject.c:1429
> >> > > > > > > > #73 0x00007f26693be541 in slot_mp_ass_subscript
> >> (self=<optimized
> >> > > out>,
> >> > > > > > > > key=<optimized out>, value=<optimized out>) at
> >> > > > > > Objects/typeobject.c:5751
> >> > > > > > > > #74 0x00007f2669424d0d in PyEval_EvalFrameEx (f=<optimized
> >> out>,
> >> > > > > > > > throwflag=<optimized out>) at Python/ceval.c:1831
> >> > > > > > > > #75 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=3, kws=0x7f25f9f27e68, kwcount=0,
> >> > > > > > > >     defs=0x7f25fb259300, defcount=1, kwdefs=0x0,
> >> closure=0x0) at
> >> > > > > > > > Python/ceval.c:3588
> >> > > > > > > > #76 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8f700,
> >> > > func=0x7f25fa0db730)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:4344
> >> > > > > > > > #77 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8f700) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #78 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #79 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=3, pp_stack=0x7fff86d8f840, func=0x7f25fa0db6a8) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #80 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8f840) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #81 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #82 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=5, kws=0x7f25f424f280, kwcount=0,
> >> > > > > > > >     defs=0x7f25fe2d1f08, defcount=1, kwdefs=0x0,
> >> closure=0x0) at
> >> > > > > > > > Python/ceval.c:3588
> >> > > > > > > > #83 0x00007f2669427ca4 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8fa30,
> >> > > func=0x7f25fa0d8e18)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:4344
> >> > > > > > > > #84 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8fa30) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #85 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #86 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=3, pp_stack=0x7fff86d8fb70, func=0x7f25fa0e1620) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #87 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8fb70) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #88 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #89 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=3, pp_stack=0x7fff86d8fcb0, func=0x7f25fa0fab70) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #90 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8fcb0) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #91 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #92 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=3, pp_stack=0x7fff86d8fdf0, func=0x7f25fe2c16a8) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #93 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8fdf0) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #94 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #95 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=1, pp_stack=0x7fff86d8ff30, func=0x7f25fe2c1620) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #96 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d8ff30) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #97 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #98 0x00007f2669427d68 in fast_function (nk=<optimized out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=4, pp_stack=0x7fff86d90070, func=0x7f25fa0fc048) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #99 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d90070) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #100 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #101 0x00007f2669427d68 in fast_function (nk=<optimized
> >> out>,
> >> > > > > > na=<optimized
> >> > > > > > > > out>, n=1, pp_stack=0x7fff86d901b0, func=0x7f25fe2c19d8) at
> >> > > > > > > > Python/ceval.c:4334
> >> > > > > > > > #102 call_function (oparg=<optimized out>,
> >> > > pp_stack=0x7fff86d901b0) at
> >> > > > > > > > Python/ceval.c:4262
> >> > > > > > > > #103 PyEval_EvalFrameEx (f=<optimized out>,
> >> throwflag=<optimized
> >> > > out>)
> >> > > > > > at
> >> > > > > > > > Python/ceval.c:2838
> >> > > > > > > > #104 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized
> >> > > out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>,
> >> args=<optimized
> >> > > out>,
> >> > > > > > > > argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0,
> >> > > > > > > >     kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
> >> > > > > > > > #105 0x00007f2669428dbb in PyEval_EvalCode (co=<optimized
> >> out>,
> >> > > > > > > > globals=<optimized out>, locals=<optimized out>) at
> >> > > Python/ceval.c:775
> >> > > > > > > > #106 0x00007f266944f770 in run_mod (arena=0x1a62b70,
> >> > > > > > flags=0x7fff86d904e0,
> >> > > > > > > > locals=0x7f2669b7a448, globals=0x7f2669b7a448,
> >> > > filename=0x7f2669ab7b20,
> >> > > > > > > > mod=<optimized out>) at Python/pythonrun.c:2180
> >> > > > > > > > #107 PyRun_FileExFlags (fp=0x1ab3c20,
> >> filename_str=<optimized
> >> > > out>,
> >> > > > > > > > start=<optimized out>, globals=0x7f2669b7a448,
> >> > > locals=0x7f2669b7a448,
> >> > > > > > > > closeit=1, flags=0x7fff86d904e0) at Python/pythonrun.c:2133
> >> > > > > > > > #108 0x00007f26694504b9 in PyRun_SimpleFileExFlags
> >> (fp=0x1ab3c20,
> >> > > > > > > > filename=<optimized out>, closeit=1, flags=0x7fff86d904e0)
> >> at
> >> > > > > > > > Python/pythonrun.c:1606
> >> > > > > > > > #109 0x00007f2669468231 in run_file (p_cf=0x7fff86d904e0,
> >> > > > > > > > filename=0x19f5830
> >> > > > > > > > L"/global/homes/r/regan/.local
> >> /bin/read_fast5_basecaller.py",
> >> > > > > > fp=0x1ab3c20)
> >> > > > > > > > at Modules/main.c:319
> >> > > > > > > > #110 Py_Main (argc=<optimized out>, argv=<optimized out>) at
> >> > > > > > > > Modules/main.c:751
> >> > > > > > > > #111 0x0000000000400ae3 in main (argc=14,
> >> argv=0x7fff86d90658) at
> >> > > > > > > > ./Modules/python.c:69
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Sat, Jul 29, 2017 at 7:21 AM, Rohan Garg <
> >> rohg...@ccs.neu.edu
> >> > > >
> >> > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > I just realized that there's an open issue related to
> >> Python
> >> > > and
> >> > > > > > > > > Numpy.  See here: https://github.com/dmtcp/dmtcp
> >> /issues/599.
> >> > > > > > > > > Unfortunately, I haven't been able to spend enough time
> >> to get
> >> > > a
> >> > > > > > > > > fix out.
> >> > > > > > > > >
> >> > > > > > > > > Could you please verify if this isn't what's going on?
> >> Even if
> >> > > it's
> >> > > > > > > > > not the exact same libraries, I think the bug, as
> >> reported on
> >> > > the
> >> > > > > > > > > Github issue, is more general and could manifest through
> >> > > different
> >> > > > > > > > > libraries, although with a similar backtrace signature.
> >> > > > > > > > >
> >> > > > > > > > > On Fri, Jul 28, 2017 at 09:53:40PM -0400, Rohan Garg
> >> wrote:
> >> > > > > > > > > > Hi Rob,
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks for trying out DMTCP and reporting this issue.
> >> We'd be
> >> > > > > > happy to
> >> > > > > > > > > > help resolve this issue.
> >> > > > > > > > > >
> >> > > > > > > > > > I'll take a look at the strace, but probably the most
> >> useful
> >> > > thing
> >> > > > > > > > > > here would be to try to figure out the process that's
> >> > > hanging and
> >> > > > > > > > > > to look at its backtrace. The backtrace from the parent
> >> > > could also
> >> > > > > > > > > > be useful. You should be able to attach gdb to the
> >> > > interesting
> >> > > > > > > > > > processes and look at the backtrace.
> >> > > > > > > > > >
> >> > > > > > > > > >     $ gdb attach <pid>
> >> > > > > > > > > >     ...
> >> > > > > > > > > >     (gdb) thread apply all bt
> >> > > > > > > > > >
> >> > > > > > > > > > This can hopefully tell us the reason for the deadlock.
> >> > > Also, as a
> >> > > > > > quick
> >> > > > > > > > > > test could you please try running your program with the
> >> > > following
> >> > > > > > DMTCP
> >> > > > > > > > > > options?
> >> > > > > > > > > >
> >> > > > > > > > > >     $ dmtcp_launch --disable-alloc-plugin
> >> --disable-dl-plugin
> >> > > > > > > > > <program-and-program-arguments>
> >> > > > > > > > > >
> >> > > > > > > > > > Also, I'd recommend working with the 2.4.x/2.5.x
> >> versions;
> >> > > we are
> >> > > > > > > > > > not maintaining the 1.x branch any longer.
> >> > > > > > > > > >
> >> > > > > > > > > > (I do have a NERSC account and have recently diagnosed
> >> some
> >> > > issues
> >> > > > > > > > > > with the CRAY toolchain/srun and DMTCP there. But it
> >> seems
> >> > > like
> >> > > > > > > > > > that that's not an issue here.)
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks,
> >> > > > > > > > > > Rohan
> >> > > > > > > > > >
> >> > > > > > > > > > On Fri, Jul 28, 2017 at 06:28:03PM -0700, Rob Egan
> >> wrote:
> >> > > > > > > > > > > Hello,
> >> > > > > > > > > > >
> >> > > > > > > > > > > I am trying to adopt dmtcp at our institution and ran
> >> into
> >> > > a
> >> > > > > > problem
> >> > > > > > > > > with
> >> > > > > > > > > > > the first script that I tried.  The script seems to
> >> start
> >> > > fine
> >> > > > > > and
> >> > > > > > > > > perform
> >> > > > > > > > > > > a few operations, but then stalls indefinitely
> >> consuming 0
> >> > > cpu
> >> > > > > > in what
> >> > > > > > > > > > > looks like a futex wait loop after the process was
> >> > > cloned.  It
> >> > > > > > is a
> >> > > > > > > > > > > multi-threaded program and the behavior is the same
> >> > > whether 1 or
> >> > > > > > more
> >> > > > > > > > > > > threads is requested.
> >> > > > > > > > > > >
> >> > > > > > > > > > > This is a 3rd party python script, so unfortunately I
> >> can't
> >> > > > > > share the
> >> > > > > > > > > code
> >> > > > > > > > > > > but I uploaded the strace from a 1-threaded execution:
> >> > > > > > > > > > >
> >> > > > > > > > > > > dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
> >> > > > > > > > > > > <http://portal.nersc.gov/dna/R
> >> D/Adv-Seq/ONT/dmtcp-stall-
> >> > > > > > > > > read_fast5_basecaller.py-strace.log.gz>
> >> > > > > > > > > > > http://portal.nersc.gov/dna/RD
> >> /Adv-Seq/ONT/dmtcp-stall-
> >> > > > > > > > > read_fast5_basecaller.py-strace.log.gz
> >> > > > > > > > > > >
> >> > > > > > > > > > > This is how I executed the dmctp & strace & script:
> >> > > > > > > > > > >
> >> > > > > > > > > > > dmtcp_launch  strace ~/.local/bin/read_fast5_baseca
> >> ller.py
> >> > > -i
> >> > > > > > > > > uploaded/0/
> >> > > > > > > > > > > -o fastq,fast5 -s basecalls6 -t 1 --flowcell
> >> FLO-MIN107
> >> > > --kit
> >> > > > > > > > > SQK-LSK308
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > And here is the output from the dmtcp_coordinator
> >> (which I
> >> > > > > > commanded to
> >> > > > > > > > > > > kill after a few minutes of inactivity):
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > regan@gpint200:/global/projectb/scratch/regan/
> >> > > > > > > > > nanopore/runs/X0124-onebatch2$
> >> > > > > > > > > > > dmtcp_coordinator
> >> > > > > > > > > > > dmtcp_coordinator starting...
> >> > > > > > > > > > >     Host: gpint200 (127.0.0.2)
> >> > > > > > > > > > >     Port: 7779
> >> > > > > > > > > > >     Checkpoint Interval: 18000
> >> > > > > > > > > > >     Exit on last client: 0
> >> > > > > > > > > > > Type '?' for help.
> >> > > > > > > > > > >
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1675 in
> >> > > > > > updateCheckpointInterval;
> >> > > > > > > > > > > REASON='CheckpointInterval updated (for this
> >> computation
> >> > > only)'
> >> > > > > > > > > > >      oldInterval = 18000
> >> > > > > > > > > > >      theCheckpointInterval = 18000
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in
> >> onConnect;
> >> > > > > > REASON='worker
> >> > > > > > > > > > > connected'
> >> > > > > > > > > > >      hello_remote.from =
> >> 704891f6494a6304-10275-597be341
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData;
> >> > > > > > REASON='Updating
> >> > > > > > > > > > > process Information after exec()'
> >> > > > > > > > > > >      progname = strace
> >> > > > > > > > > > >      msg.from = 704891f6494a6304-40000-597be341
> >> > > > > > > > > > >      client->identity() =
> >> 704891f6494a6304-10275-597be341
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in
> >> onConnect;
> >> > > > > > REASON='worker
> >> > > > > > > > > > > connected'
> >> > > > > > > > > > >      hello_remote.from =
> >> 704891f6494a6304-40000-597be341
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData;
> >> > > > > > REASON='Updating
> >> > > > > > > > > > > process Information after fork()'
> >> > > > > > > > > > >      client->hostname() = gpint200
> >> > > > > > > > > > >      client->progname() = strace_(forked)
> >> > > > > > > > > > >      msg.from = 704891f6494a6304-41000-597be341
> >> > > > > > > > > > >      client->identity() =
> >> 704891f6494a6304-40000-597be341
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData;
> >> > > > > > REASON='Updating
> >> > > > > > > > > > > process Information after exec()'
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in
> >> onConnect;
> >> > > > > > REASON='worker
> >> > > > > > > > > > > connected'
> >> > > > > > > > > > >      hello_remote.from =
> >> 704891f6494a6304-40000-597be341
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData;
> >> > > > > > REASON='Updating
> >> > > > > > > > > > > process Information after fork()'
> >> > > > > > > > > > >      client->hostname() = gpint200
> >> > > > > > > > > > >      client->progname() = strace_(forked)
> >> > > > > > > > > > >      msg.from = 704891f6494a6304-41000-597be341
> >> > > > > > > > > > >      client->identity() =
> >> 704891f6494a6304-40000-597be341
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData;
> >> > > > > > REASON='Updating
> >> > > > > > > > > > > process Information after exec()'
> >> > > > > > > > > > >      progname = python3.4
> >> > > > > > > > > > >      msg.from = 704891f6494a6304-41000-597be341
> >> > > > > > > > > > >      client->identity() =
> >> 704891f6494a6304-41000-597be341
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in
> >> onConnect;
> >> > > > > > REASON='worker
> >> > > > > > > > > > > connected'
> >> > > > > > > > > > >      hello_remote.from =
> >> 704891f6494a6304-41000-597be341
> >> > > > > > > > > > > k
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:603 in
> >> > > handleUserCommand;
> >> > > > > > > > > > > REASON='Killing all connected Peers...'
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in
> >> onDisconnect;
> >> > > > > > > > > REASON='client
> >> > > > > > > > > > > disconnected'
> >> > > > > > > > > > >      client->identity() =
> >> 704891f6494a6304-40000-597be341
> >> > > > > > > > > > >      client->progname() = strace
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in
> >> onDisconnect;
> >> > > > > > > > > REASON='client
> >> > > > > > > > > > > disconnected'
> >> > > > > > > > > > >      client->identity() =
> >> 704891f6494a6304-41000-597be341
> >> > > > > > > > > > >      client->progname() = python3.4_(forked)
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in
> >> onDisconnect;
> >> > > > > > > > > REASON='client
> >> > > > > > > > > > > disconnected'
> >> > > > > > > > > > >      client->identity() =
> >> 704891f6494a6304-41000-597be341
> >> > > > > > > > > > >      client->progname() = python3.4
> >> > > > > > > > > > > l
> >> > > > > > > > > > > Client List:
> >> > > > > > > > > > > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> >> > > > > > > > > > > q
> >> > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:590 in
> >> > > handleUserCommand;
> >> > > > > > > > > > > REASON='killing all connected peers and quitting ...'
> >> > > > > > > > > > > DMTCP coordinator exiting... (per request)
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > This is a process that on large data sets can take a
> >> few
> >> > > days on
> >> > > > > > a 32
> >> > > > > > > > > core
> >> > > > > > > > > > > node, so it is something that we would really like to
> >> be
> >> > > able to
> >> > > > > > > > > checkpoint
> >> > > > > > > > > > > on our cluster.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Please let me know how I can help debug this with you.
> >> > > I've
> >> > > > > > tried
> >> > > > > > > > > versions
> >> > > > > > > > > > > 1.2.5 and 2.4.8 and 2.5 and they all have the same
> >> problem.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Thanks,
> >> > > > > > > > > > > Rob Egan
> >> > > > > > > > > >
> >> > > > > > > > > > > ------------------------------
> >> > > ------------------------------
> >> > > > > > > > > ------------------
> >> > > > > > > > > > > Check out the vibrant tech community on one of the
> >> world's
> >> > > most
> >> > > > > > > > > > > engaging tech sites, Slashdot.org!
> >> > > http://sdm.link/slashdot
> >> > > > > > > > > >
> >> > > > > > > > > > > _______________________________________________
> >> > > > > > > > > > > Dmtcp-forum mailing list
> >> > > > > > > > > > > Dmtcp-forum@lists.sourceforge.net
> >> > > > > > > > > > > https://lists.sourceforge.net/
> >> lists/listinfo/dmtcp-forum
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > ------------------------------
> >> ------------------------------
> >> > > > > > > > > ------------------
> >> > > > > > > > > > Check out the vibrant tech community on one of the
> >> world's
> >> > > most
> >> > > > > > > > > > engaging tech sites, Slashdot.org!
> >> http://sdm.link/slashdot
> >> > > > > > > > > > _______________________________________________
> >> > > > > > > > > > Dmtcp-forum mailing list
> >> > > > > > > > > > Dmtcp-forum@lists.sourceforge.net
> >> > > > > > > > > > https://lists.sourceforge.net/
> >> lists/listinfo/dmtcp-forum
> >> > > > > > > > >
> >> > > > > > >
> >> > > > > > > > ------------------------------
> >> ------------------------------
> >> > > > > > ------------------
> >> > > > > > > > Check out the vibrant tech community on one of the world's
> >> most
> >> > > > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> >> > > > > > >
> >> > > > > > > > _______________________________________________
> >> > > > > > > > Dmtcp-forum mailing list
> >> > > > > > > > Dmtcp-forum@lists.sourceforge.net
> >> > > > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > ------------------------------------------------------------
> >> > > > > > ------------------
> >> > > > > > > Check out the vibrant tech community on one of the world's
> >> most
> >> > > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> >> > > > > > > _______________________________________________
> >> > > > > > > Dmtcp-forum mailing list
> >> > > > > > > Dmtcp-forum@lists.sourceforge.net
> >> > > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >> > > > > >
> >> > > >
> >> > > > > ------------------------------------------------------------
> >> > > ------------------
> >> > > > > Check out the vibrant tech community on one of the world's most
> >> > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> >> > > >
> >> > > > > _______________________________________________
> >> > > > > Dmtcp-forum mailing list
> >> > > > > Dmtcp-forum@lists.sourceforge.net
> >> > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >> > > >
> >> > >
> >> > >
> >> > >
> >>
> >
> >

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to