Hi Rob, For some reason, the DMTCP forum on SourceForge is acting weird; it rejected my last message. So I'm not sure if you received my last e-mail.
We are looking into this actively. Jiajun has also opened an issue on Github (see: https://github.com/dmtcp/dmtcp/issues/612), which we believe is the reason for the bug that you are seeing. Hopefully, we should have a fix soon. -Rohan On Fri, Aug 04, 2017 at 02:23:43PM -0400, Rohan Garg wrote: > Hi Rob, > > > I was discussing this with Jiajun, and it seems like we might be able to > write a simple test program to simulate this bug. This will definitely > require a fix in DMTCP. Hopefully, we can provide a patch to you soon. > > I'll try to work on it this weekend. We'll see if a screen sharing > session is necessary. > > Thanks, > Rohan > > On Tue, Aug 01, 2017 at 11:00:50PM -0700, Rob Egan wrote: > > Thanks Jiajun & Rohan! > > > > So in that trace before threads 2 through 41 are all identical, so almost > > all the threads are that nanosleep. Since this is third party software, my > > ability to change the environment and code is quite limited and python 3.4 > > is the only version that I was able to get working in this environment. > > > > Jiajun's explanation sounds like the issue to me. What would be the next > > steps to address it? Would we need to patch dmtcp? openblas? > > > > I'd be happy to arrange a screen sharing / zoom / hangouts session with one > > or more of you. > > > > -Rob > > > > On Tue, Aug 1, 2017 at 9:37 AM, Jiajun Cao <jia...@ccs.neu.edu> wrote: > > > > > Hi Rob, > > > > > > I know what's going on here after browsing some blas > > > source code. As Rohan mentioned, thread 42 is behaving normally. > > > What looks more interesting is the calling stack of thread 1, the > > > main thread, especially the bottom 5 frames. It seems that the blas > > > library registers some handler at fork, using pthread_atfork(): > > > > > > https://github.com/xianyi/OpenBLAS/blob/e70a6b92bf9f69f113e198e3ded558 > > > ed0675b038/driver/others/memory.c > > > > > > The handler tries to join a bunch of other blas threads, in this case, > > > including thread 41. > > > > > > The problem is, DMTCP has its internal synchronization semantics. > > > When a thread exits, it tries to grab a lock to preventing a checkpoint > > > from happening: it is an internal critical section. On the other hand, > > > when the application calls fork(), it also grabs the lock before fork() > > > returns: another internal critical section. That's why we see a deadlock > > > here: > > > > > > thread 1: lock acquired, and waiting for thread 41 to finish > > > thread 41: trying to acquire the lock before it exits > > > > > > Best, > > > Jiajun > > > > > > On Mon, Jul 31, 2017 at 10:15:15PM -0700, Rohan Garg wrote: > > > > Hi Rob, > > > > > > > > Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in every > > > > user process that's launched using DMTCP. It seems to be simply > > > > waiting for a message from the coordinator, which is the normal > > > > state of things. It's not stuck or the cause of the deadlock here. > > > > > > > > The deadlock is definitely because of one or more user threads. It > > > > seems like the program also forks and execs some child processes. > > > > We have seen some corner cases around that but it's difficult to > > > > say what it is with the information below. > > > > > > > > The fact that even `--disable-all-plugins` didn't work is quite > > > > interesting. > > > > > > > > I have a few other suggestions: > > > > > > > > - Is it possible to share perhaps the top 20-25 frames from each of > > > > the threads? (You can also send them to us/me directly at: > > > > dm...@ccs.neu.edu/rohg...@ccs.neu.edu) > > > > > > > > Also, after you save the backtraces from all the threads, detach > > > > from gdb, wait for a few seconds, re-attach, and save a second > > > > set of backtraces. It'll help us figure out which threads are > > > > really stuck. > > > > > > > > - Is it possible to run with other Python versions (not 3.4) as a > > > > quick test? > > > > > > > > - What Numpy/OpenBLAS versions are you using? I'll try to run > > > > a few experiments locally. > > > > > > > > - Is it possible try to cut down your Python script to the bare > > > > minimum, to isolate the issue? Perhaps just with a few imports > > > > and then add code incrementally. But if it turns out to be more > > > > work than its worth, then we should skip this suggestion. > > > > > > > > - Finally, if it's possible, we could arrange for a screen sharing > > > > session. I think it'll be more efficient for interactively > > > > debugging this. > > > > > > > > -Rohan > > > > > > > > On Mon, Jul 31, 2017 at 03:30:55PM -0700, Rob Egan wrote: > > > > > Hi Rohan, > > > > > > > > > > Thanks for the quick reply. So using version 2.4.8 and the > > > > > --disable-alloc-plugin and --disable-dl-plugin the behavior is the > > > same. I > > > > > also tried --disable-all-plugins and it still did not work, so I think > > > this > > > > > may be different from the numpy bug you identified, though as you will > > > will > > > > > see the stack trace does show one of the threads in the numpy / > > > openblas > > > > > library. > > > > > > > > > > Though we are running at NERSC, I am not running this on the cray > > > systems > > > > > (yet) and not through slurm (yet). > > > > > > > > > > When I attach to the running process, unfortunately there are a ton of > > > > > threads and a deep python stack. Apparently even though I only > > > requested 1 > > > > > thread, the python library allocates a ton of them. I tried to remove > > > > > duplicates manually for you. It looks like thread 42 and 1 are the > > > > > interesting ones. 2-41 are all the same. It looks to me like > > > > > dmtcp_coordinator is interfering with the blas library's pthread_join > > > calls. > > > > > > > > > > > > > > > > > > > > > > > > > Thread 42 (Thread 0x7f266738e700 (LWP 14136)): > > > > > #0 0x00007f26690ed14d in read () from /lib/libpthread.so.0 > > > > > #1 0x00007f2669804224 in _real_read (fd=<optimized out>, > > > buf=<optimized > > > > > out>, count=<optimized out>) at syscallsreal.c:504 > > > > > #2 0x00007f26697eae80 in dmtcp::Util::readAll (fd=821, > > > buf=0x7f266738bc00, > > > > > count=136) at util_misc.cpp:279 > > > > > #3 0x00007f26697c303b in operator>><dmtcp::DmtcpMessage> (t=..., > > > > > this=0x7f2669c33408) at ../jalib/jsocket.h:109 > > > > > #4 dmtcp::CoordinatorAPI::recvMsgFromCoordinator > > > (this=0x7f2669c33408, > > > > > msg=0x7f266738bc00, extraData=0x0) at coordinatorapi.cpp:422 > > > > > #5 0x00007f26697bb9fc in dmtcp::DmtcpWorker::waitForCoordinatorMsg > > > > > (msgStr=..., type=dmtcp::DMT_DO_SUSPEND) at dmtcpworker.cpp:469 > > > > > #6 0x00007f26697bc58b in dmtcp::DmtcpWorker::waitForStage1Suspend () > > > at > > > > > dmtcpworker.cpp:513 > > > > > #7 0x00007f26697ccd1a in dmtcp::callbackSleepBetweenCheckpoint > > > > > (sec=<optimized out>) at mtcpinterface.cpp:61 > > > > > #8 0x00007f26697d85e9 in checkpointhread (dummy=<optimized out>) at > > > > > threadlist.cpp:359 > > > > > #9 0x00007f26697cfb9e in pthread_start (arg=<optimized out>) at > > > > > threadwrappers.cpp:159 > > > > > #10 0x00007f26690e58ca in start_thread () from /lib/libpthread.so.0 > > > > > #11 0x00007f26697cf9c9 in clone_start (arg=0x7f2669c15008) at > > > > > threadwrappers.cpp:68 > > > > > #12 0x00007f26687c3abd in clone () from /lib/libc.so.6 > > > > > #13 0x0000000000000000 in ?? () > > > > > > > > > > Thread 41 (Thread 0x7f2660f99700 (LWP 14137)): > > > > > #0 0x00007f26690ed86d in nanosleep () from /lib/libpthread.so.0 > > > > > #1 0x00007f26697bf599 in dmtcp::ThreadSync::wrapperExecutionLockLock > > > () at > > > > > threadsync.cpp:392 > > > > > #2 0x00007f26697cfba6 in pthread_start (arg=<optimized out>) at > > > > > threadwrappers.cpp:161 > > > > > #3 0x00007f26690e58ca in start_thread () from /lib/libpthread.so.0 > > > > > #4 0x00007f26697cf9c9 in clone_start (arg=0x7f2669c24008) at > > > > > threadwrappers.cpp:68 > > > > > #5 0x00007f26687c3abd in clone () from /lib/libc.so.6 > > > > > #6 0x0000000000000000 in ?? () > > > > > > > > > > Thread 40 (Thread 0x7f2660798700 (LWP 14138)): > > > > > ... > > > > > > > > > > Thread 1 (Thread 0x7f2669c26740 (LWP 14131)): > > > > > #0 0x00007f26690ece69 in __lll_timedwait_tid () from > > > /lib/libpthread.so.0 > > > > > #1 0x00007f26690e6ebf in pthread_timedjoin_np () from > > > /lib/libpthread.so.0 > > > > > #2 0x00007f26697d0048 in pthread_join (thread=139802812454656, > > > retval=0x0) > > > > > at threadwrappers.cpp:274 > > > > > #3 0x00007f26615cd2d7 in blas_thread_shutdown_ () from > > > > > /global/homes/r/regan/.local/lib/python3.4/site-packages/ > > > numpy/core/../.libs/ > > > > > libopenblasp-r0-39a31c03.2.18.so > > > > > #4 0x00007f26687936d5 in fork () from /lib/libc.so.6 > > > > > #5 0x00007f26697c9724 in fork () at execwrappers.cpp:186 > > > > > #6 0x00007f2665b9ed7f in subprocess_fork_exec (self=<optimized out>, > > > > > args=<optimized out>) at > > > > > /global/common/genepool/usg/languages/python/build/Python- > > > 3.4.3/Modules/_posixsubprocess.c:649 > > > > > #7 0x00007f2669427225 in call_function (oparg=<optimized out>, > > > > > pp_stack=0x7fff86d8d690) at Python/c > > > > > eval.c:4237 > > > > > #8 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #9 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=19, kws=0x7f25f4014ad8, kwcount=0, defs=0x0, > > > > > defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > > > > #10 0x00007f2669427ca4 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8d880, func=0x7f2669a60ea0) > > > at > > > > > Python/ceval.c:4344 > > > > > #11 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8d880) at > > > > > Python/ceval.c:4262 > > > > > #12 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #13 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=2, kws=0x7f25fa0dda70, kwcount=4, > > > > > defs=0x7f2669a55110, defcount=16, kwdefs=0x0, closure=0x0) at > > > > > Python/ceval.c:3588 > > > > > #14 0x00007f266938a2bc in function_call (func=0x7f2669a60840, > > > > > arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at Objects/funcobject.c:632 > > > > > #15 0x00007f266935e8be in PyObject_Call (func=0x7f2669a60840, > > > > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > > > > #16 0x00007f2669375354 in method_call (func=0x7f2669a60840, > > > > > arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at Objects/classobject.c:347 > > > > > #17 0x00007f266935e8be in PyObject_Call (func=0x7f2663ff4148, > > > > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > > > > #18 0x00007f26693c15f7 in slot_tp_init (self=0x7f25f9f45198, > > > > > args=0x7f25f9f45128, kwds=0x7f25f9f46208) at Objects/typeobject.c:6177 > > > > > #19 0x00007f26693bd17f in type_call (type=<optimized out>, > > > > > args=0x7f25f9f45128, kwds=0x7f25f9f46208) at Objects/typeobject.c:898 > > > > > #20 0x00007f266935e8be in PyObject_Call (func=0x1b8c0a8, > > > > > arg=<optimized > > > > > out>, kw=<optimized out>) at Objects/abstract.c:2040 > > > > > #21 0x00007f2669425a06 in do_call (nk=<optimized out>, na=<optimized > > > out>, > > > > > pp_stack=0x7fff86d8dc80, func=0x1b8c0a8) at Python/ceval.c:4466 > > > > > #22 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8dc80) at > > > > > Python/ceval.c:4264 > > > > > #23 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #24 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=1, pp_stack=0x7fff86d8ddc0, func=0x7f25f9e96bf8) at > > > > > Python/ceval.c:4334 > > > > > #25 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8ddc0) at > > > > > Python/ceval.c:4262 > > > > > #26 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #27 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=1, pp_stack=0x7fff86d8df00, func=0x7f25f9e96c80) at > > > > > Python/ceval.c:4334 > > > > > #28 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8df00) at > > > > > Python/ceval.c:4262 > > > > > #29 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #30 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, > > > > > kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > > > > #31 0x00007f2669428dbb in PyEval_EvalCode (co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:775 > > > > > #32 0x00007f266941d3d8 in builtin_exec (self=<optimized out>, > > > > > args=<optimized out>) at Python/bltinmodule.c:882 > > > > > #33 0x00007f2669426247 in ext_do_call (nk=0, na=0, flags=<optimized > > > out>, > > > > > pp_stack=0x7fff86d8e1f0, func=0x7f2669c05788) at Python/ceval.c:4558 > > > > > #34 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2878 > > > > > #35 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=3, kws=0x7f2663d13938, kwcount=0, defs=0x0, > > > > > defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > > > > #36 0x00007f2669427ca4 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8e3e0, func=0x7f2669bd3b70) > > > at > > > > > Python/ceval.c:4344 > > > > > #37 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e3e0) at > > > > > Python/ceval.c:4262 > > > > > #38 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #39 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=2, pp_stack=0x7fff86d8e520, func=0x7f2669b5d158) at > > > > > Python/ceval.c:4334 > > > > > #40 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e520) at > > > > > Python/ceval.c:4262 > > > > > #41 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #42 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=2, pp_stack=0x7fff86d8e660, func=0x7f2669b56ea0) at > > > > > Python/ceval.c:4334 > > > > > #43 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e660) at > > > > > Python/ceval.c:4262 > > > > > #44 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #45 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=1, pp_stack=0x7fff86d8e7a0, func=0x7f2669b590d0) at > > > > > Python/ceval.c:4334 > > > > > #46 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e7a0) at > > > > > Python/ceval.c:4262 > > > > > #47 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #48 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=2, pp_stack=0x7fff86d8e8e0, func=0x7f2669b61620) at > > > > > Python/ceval.c:4334 > > > > > #49 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e8e0) at > > > > > Python/ceval.c:4262 > > > > > #50 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #51 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, > > > > > kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > > > > #52 0x00007f266938a1b5 in function_call (func=0x7f2669b616a8, > > > > > arg=0x7f25f9f2cf48, kw=0x0) at Objects/funcobject.c:632 > > > > > #53 0x00007f266935e8be in PyObject_Call (func=0x7f2669b616a8, > > > > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > > > > #54 0x00007f266935f2cd in _PyObject_CallMethodIdObjArgs > > > > > (callable=0x7f2669b616a8, name=<optimized out>) at > > > Objects/abstract.c:2310 > > > > > #55 0x00007f26694427bc in PyImport_ImportModuleLevelObject > > > > > (name=0x7f25fb28aa78, given_globals=<optimized out>, locals=<optimized > > > > > out>, given_fromlist=<optimized out>, level=0) at Python/import.c:1636 > > > > > #56 0x00007f266941dcaf in builtin___import__ (self=<optimized out>, > > > > > args=<optimized out>, kwds=<optimized out>) at > > > > > Python/bltinmodule.c:210 > > > > > #57 0x00007f266935e8be in PyObject_Call (func=0x7f2669c053c8, > > > > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > > > > #58 0x00007f2669420227 in PyEval_CallObjectWithKeywords > > > > > (func=0x7f2669c053c8, arg=0x7f25fa0dfc50, kw=<optimized out>) at > > > > > Python/ceval.c:4114 > > > > > #59 0x00007f2669421907 in PyEval_EvalFrameEx (f=<optimized out>, > > > > > throwflag=<optimized out>) at Python/ceval.c:2478 > > > > > #60 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=2, kws=0x7f25f9f3e3d0, kwcount=2, > > > > > defs=0x7f25fa5ed360, defcount=2, kwdefs=0x0, closure=0x0) at > > > > > Python/ceval.c:3588 > > > > > #61 0x00007f2669427ca4 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8eff0, func=0x7f25fa5efe18) > > > at > > > > > Python/ceval.c:4344 > > > > > #62 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8eff0) at > > > > > Python/ceval.c:4262 > > > > > #63 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #64 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=3, kws=0x7f2669bd5060, kwcount=0, defs=0x0, > > > > > defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > > > > #65 0x00007f266938a2bc in function_call (func=0x7f25fa5efd08, > > > > > arg=0x7f25fa0f4ab0, kw=0x7f25f9f2cdc8) at Objects/funcobject.c:632 > > > > > #66 0x00007f25fc7ae500 in __Pyx_PyObject_Call (kw=0x7f25f9f2cdc8, > > > > > arg=<optimized out>, func=<optimized out>) at > > > > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_ > > > objects.c:7420 > > > > > #67 __pyx_pf_4h5py_8_objects_9with_phil_wrapper > > > > > (__pyx_v_kwds=0x7f25f9f2cdc8, __pyx_v_args=0x7f25fa0f4ab0, > > > > > __pyx_self=<optimized out>) > > > > > at > > > > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_ > > > objects.c:2429 > > > > > #68 __pyx_pw_4h5py_8_objects_9with_phil_1wrapper > > > (__pyx_self=<optimized > > > > > out>, __pyx_args=0x7f25fa0f4ab0, __pyx_kwds=<optimized out>) > > > > > at > > > > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_ > > > objects.c:2345 > > > > > #69 0x00007f266935e8be in PyObject_Call (func=0x7f25fa5ec498, > > > > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > > > > #70 0x00007f2669375354 in method_call (func=0x7f25fa5ec498, > > > > > arg=0x7f25fa0f4ab0, kw=0x0) at Objects/classobject.c:347 > > > > > #71 0x00007f266935e8be in PyObject_Call (func=0x7f2663d3a048, > > > > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > > > > #72 0x00007f26693be3ec in call_method (o=<optimized out>, > > > > > nameid=0x7f266974d0d0 <PyId___setitem__>, format=0x7f26694e7609 > > > "(OO)") at > > > > > Objects/typeobject.c:1429 > > > > > #73 0x00007f26693be541 in slot_mp_ass_subscript (self=<optimized out>, > > > > > key=<optimized out>, value=<optimized out>) at > > > Objects/typeobject.c:5751 > > > > > #74 0x00007f2669424d0d in PyEval_EvalFrameEx (f=<optimized out>, > > > > > throwflag=<optimized out>) at Python/ceval.c:1831 > > > > > #75 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=3, kws=0x7f25f9f27e68, kwcount=0, > > > > > defs=0x7f25fb259300, defcount=1, kwdefs=0x0, closure=0x0) at > > > > > Python/ceval.c:3588 > > > > > #76 0x00007f2669427ca4 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8f700, func=0x7f25fa0db730) > > > at > > > > > Python/ceval.c:4344 > > > > > #77 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8f700) at > > > > > Python/ceval.c:4262 > > > > > #78 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #79 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=3, pp_stack=0x7fff86d8f840, func=0x7f25fa0db6a8) at > > > > > Python/ceval.c:4334 > > > > > #80 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8f840) at > > > > > Python/ceval.c:4262 > > > > > #81 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #82 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=5, kws=0x7f25f424f280, kwcount=0, > > > > > defs=0x7f25fe2d1f08, defcount=1, kwdefs=0x0, closure=0x0) at > > > > > Python/ceval.c:3588 > > > > > #83 0x00007f2669427ca4 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=<optimized out>, pp_stack=0x7fff86d8fa30, func=0x7f25fa0d8e18) > > > at > > > > > Python/ceval.c:4344 > > > > > #84 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8fa30) at > > > > > Python/ceval.c:4262 > > > > > #85 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #86 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=3, pp_stack=0x7fff86d8fb70, func=0x7f25fa0e1620) at > > > > > Python/ceval.c:4334 > > > > > #87 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8fb70) at > > > > > Python/ceval.c:4262 > > > > > #88 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #89 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=3, pp_stack=0x7fff86d8fcb0, func=0x7f25fa0fab70) at > > > > > Python/ceval.c:4334 > > > > > #90 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8fcb0) at > > > > > Python/ceval.c:4262 > > > > > #91 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #92 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=3, pp_stack=0x7fff86d8fdf0, func=0x7f25fe2c16a8) at > > > > > Python/ceval.c:4334 > > > > > #93 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8fdf0) at > > > > > Python/ceval.c:4262 > > > > > #94 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #95 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=1, pp_stack=0x7fff86d8ff30, func=0x7f25fe2c1620) at > > > > > Python/ceval.c:4334 > > > > > #96 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8ff30) at > > > > > Python/ceval.c:4262 > > > > > #97 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #98 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=4, pp_stack=0x7fff86d90070, func=0x7f25fa0fc048) at > > > > > Python/ceval.c:4334 > > > > > #99 call_function (oparg=<optimized out>, pp_stack=0x7fff86d90070) at > > > > > Python/ceval.c:4262 > > > > > #100 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #101 0x00007f2669427d68 in fast_function (nk=<optimized out>, > > > na=<optimized > > > > > out>, n=1, pp_stack=0x7fff86d901b0, func=0x7f25fe2c19d8) at > > > > > Python/ceval.c:4334 > > > > > #102 call_function (oparg=<optimized out>, pp_stack=0x7fff86d901b0) at > > > > > Python/ceval.c:4262 > > > > > #103 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) > > > at > > > > > Python/ceval.c:2838 > > > > > #104 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > > > > argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, > > > > > kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > > > > #105 0x00007f2669428dbb in PyEval_EvalCode (co=<optimized out>, > > > > > globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:775 > > > > > #106 0x00007f266944f770 in run_mod (arena=0x1a62b70, > > > flags=0x7fff86d904e0, > > > > > locals=0x7f2669b7a448, globals=0x7f2669b7a448, > > > > > filename=0x7f2669ab7b20, > > > > > mod=<optimized out>) at Python/pythonrun.c:2180 > > > > > #107 PyRun_FileExFlags (fp=0x1ab3c20, filename_str=<optimized out>, > > > > > start=<optimized out>, globals=0x7f2669b7a448, locals=0x7f2669b7a448, > > > > > closeit=1, flags=0x7fff86d904e0) at Python/pythonrun.c:2133 > > > > > #108 0x00007f26694504b9 in PyRun_SimpleFileExFlags (fp=0x1ab3c20, > > > > > filename=<optimized out>, closeit=1, flags=0x7fff86d904e0) at > > > > > Python/pythonrun.c:1606 > > > > > #109 0x00007f2669468231 in run_file (p_cf=0x7fff86d904e0, > > > > > filename=0x19f5830 > > > > > L"/global/homes/r/regan/.local/bin/read_fast5_basecaller.py", > > > fp=0x1ab3c20) > > > > > at Modules/main.c:319 > > > > > #110 Py_Main (argc=<optimized out>, argv=<optimized out>) at > > > > > Modules/main.c:751 > > > > > #111 0x0000000000400ae3 in main (argc=14, argv=0x7fff86d90658) at > > > > > ./Modules/python.c:69 > > > > > > > > > > > > > > > On Sat, Jul 29, 2017 at 7:21 AM, Rohan Garg <rohg...@ccs.neu.edu> > > > wrote: > > > > > > > > > > > I just realized that there's an open issue related to Python and > > > > > > Numpy. See here: https://github.com/dmtcp/dmtcp/issues/599. > > > > > > Unfortunately, I haven't been able to spend enough time to get a > > > > > > fix out. > > > > > > > > > > > > Could you please verify if this isn't what's going on? Even if it's > > > > > > not the exact same libraries, I think the bug, as reported on the > > > > > > Github issue, is more general and could manifest through different > > > > > > libraries, although with a similar backtrace signature. > > > > > > > > > > > > On Fri, Jul 28, 2017 at 09:53:40PM -0400, Rohan Garg wrote: > > > > > > > Hi Rob, > > > > > > > > > > > > > > Thanks for trying out DMTCP and reporting this issue. We'd be > > > happy to > > > > > > > help resolve this issue. > > > > > > > > > > > > > > I'll take a look at the strace, but probably the most useful thing > > > > > > > here would be to try to figure out the process that's hanging and > > > > > > > to look at its backtrace. The backtrace from the parent could also > > > > > > > be useful. You should be able to attach gdb to the interesting > > > > > > > processes and look at the backtrace. > > > > > > > > > > > > > > $ gdb attach <pid> > > > > > > > ... > > > > > > > (gdb) thread apply all bt > > > > > > > > > > > > > > This can hopefully tell us the reason for the deadlock. Also, as a > > > quick > > > > > > > test could you please try running your program with the following > > > DMTCP > > > > > > > options? > > > > > > > > > > > > > > $ dmtcp_launch --disable-alloc-plugin --disable-dl-plugin > > > > > > <program-and-program-arguments> > > > > > > > > > > > > > > Also, I'd recommend working with the 2.4.x/2.5.x versions; we are > > > > > > > not maintaining the 1.x branch any longer. > > > > > > > > > > > > > > (I do have a NERSC account and have recently diagnosed some issues > > > > > > > with the CRAY toolchain/srun and DMTCP there. But it seems like > > > > > > > that that's not an issue here.) > > > > > > > > > > > > > > Thanks, > > > > > > > Rohan > > > > > > > > > > > > > > On Fri, Jul 28, 2017 at 06:28:03PM -0700, Rob Egan wrote: > > > > > > > > Hello, > > > > > > > > > > > > > > > > I am trying to adopt dmtcp at our institution and ran into a > > > problem > > > > > > with > > > > > > > > the first script that I tried. The script seems to start fine > > > and > > > > > > perform > > > > > > > > a few operations, but then stalls indefinitely consuming 0 cpu > > > in what > > > > > > > > looks like a futex wait loop after the process was cloned. It > > > is a > > > > > > > > multi-threaded program and the behavior is the same whether 1 or > > > more > > > > > > > > threads is requested. > > > > > > > > > > > > > > > > This is a 3rd party python script, so unfortunately I can't > > > share the > > > > > > code > > > > > > > > but I uploaded the strace from a 1-threaded execution: > > > > > > > > > > > > > > > > dmtcp-stall-read_fast5_basecaller.py-strace.log.gz > > > > > > > > <http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall- > > > > > > read_fast5_basecaller.py-strace.log.gz> > > > > > > > > http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall- > > > > > > read_fast5_basecaller.py-strace.log.gz > > > > > > > > > > > > > > > > This is how I executed the dmctp & strace & script: > > > > > > > > > > > > > > > > dmtcp_launch strace ~/.local/bin/read_fast5_basecaller.py -i > > > > > > uploaded/0/ > > > > > > > > -o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107 --kit > > > > > > SQK-LSK308 > > > > > > > > > > > > > > > > > > > > > > > > And here is the output from the dmtcp_coordinator (which I > > > commanded to > > > > > > > > kill after a few minutes of inactivity): > > > > > > > > > > > > > > > > > > > > > > > > regan@gpint200:/global/projectb/scratch/regan/ > > > > > > nanopore/runs/X0124-onebatch2$ > > > > > > > > dmtcp_coordinator > > > > > > > > dmtcp_coordinator starting... > > > > > > > > Host: gpint200 (127.0.0.2) > > > > > > > > Port: 7779 > > > > > > > > Checkpoint Interval: 18000 > > > > > > > > Exit on last client: 0 > > > > > > > > Type '?' for help. > > > > > > > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1675 in > > > updateCheckpointInterval; > > > > > > > > REASON='CheckpointInterval updated (for this computation only)' > > > > > > > > oldInterval = 18000 > > > > > > > > theCheckpointInterval = 18000 > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; > > > REASON='worker > > > > > > > > connected' > > > > > > > > hello_remote.from = 704891f6494a6304-10275-597be341 > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; > > > REASON='Updating > > > > > > > > process Information after exec()' > > > > > > > > progname = strace > > > > > > > > msg.from = 704891f6494a6304-40000-597be341 > > > > > > > > client->identity() = 704891f6494a6304-10275-597be341 > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; > > > REASON='worker > > > > > > > > connected' > > > > > > > > hello_remote.from = 704891f6494a6304-40000-597be341 > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; > > > REASON='Updating > > > > > > > > process Information after fork()' > > > > > > > > client->hostname() = gpint200 > > > > > > > > client->progname() = strace_(forked) > > > > > > > > msg.from = 704891f6494a6304-41000-597be341 > > > > > > > > client->identity() = 704891f6494a6304-40000-597be341 > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; > > > REASON='Updating > > > > > > > > process Information after exec()' > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; > > > REASON='worker > > > > > > > > connected' > > > > > > > > hello_remote.from = 704891f6494a6304-40000-597be341 > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; > > > REASON='Updating > > > > > > > > process Information after fork()' > > > > > > > > client->hostname() = gpint200 > > > > > > > > client->progname() = strace_(forked) > > > > > > > > msg.from = 704891f6494a6304-41000-597be341 > > > > > > > > client->identity() = 704891f6494a6304-40000-597be341 > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; > > > REASON='Updating > > > > > > > > process Information after exec()' > > > > > > > > progname = python3.4 > > > > > > > > msg.from = 704891f6494a6304-41000-597be341 > > > > > > > > client->identity() = 704891f6494a6304-41000-597be341 > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; > > > REASON='worker > > > > > > > > connected' > > > > > > > > hello_remote.from = 704891f6494a6304-41000-597be341 > > > > > > > > k > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:603 in handleUserCommand; > > > > > > > > REASON='Killing all connected Peers...' > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; > > > > > > REASON='client > > > > > > > > disconnected' > > > > > > > > client->identity() = 704891f6494a6304-40000-597be341 > > > > > > > > client->progname() = strace > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; > > > > > > REASON='client > > > > > > > > disconnected' > > > > > > > > client->identity() = 704891f6494a6304-41000-597be341 > > > > > > > > client->progname() = python3.4_(forked) > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; > > > > > > REASON='client > > > > > > > > disconnected' > > > > > > > > client->identity() = 704891f6494a6304-41000-597be341 > > > > > > > > client->progname() = python3.4 > > > > > > > > l > > > > > > > > Client List: > > > > > > > > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > > > > > > > > q > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:590 in handleUserCommand; > > > > > > > > REASON='killing all connected peers and quitting ...' > > > > > > > > DMTCP coordinator exiting... (per request) > > > > > > > > > > > > > > > > > > > > > > > > This is a process that on large data sets can take a few days on > > > a 32 > > > > > > core > > > > > > > > node, so it is something that we would really like to be able to > > > > > > checkpoint > > > > > > > > on our cluster. > > > > > > > > > > > > > > > > Please let me know how I can help debug this with you. I've > > > tried > > > > > > versions > > > > > > > > 1.2.5 and 2.4.8 and 2.5 and they all have the same problem. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Rob Egan > > > > > > > > > > > > > > > ------------------------------------------------------------ > > > > > > ------------------ > > > > > > > > Check out the vibrant tech community on one of the world's most > > > > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Dmtcp-forum mailing list > > > > > > > > Dmtcp-forum@lists.sourceforge.net > > > > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------ > > > > > > ------------------ > > > > > > > Check out the vibrant tech community on one of the world's most > > > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > > > > _______________________________________________ > > > > > > > Dmtcp-forum mailing list > > > > > > > Dmtcp-forum@lists.sourceforge.net > > > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > > > > > > > > > > > ------------------------------------------------------------ > > > ------------------ > > > > > Check out the vibrant tech community on one of the world's most > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > > > > > > _______________________________________________ > > > > > Dmtcp-forum mailing list > > > > > Dmtcp-forum@lists.sourceforge.net > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > > > > > > > > ------------------------------------------------------------ > > > ------------------ > > > > Check out the vibrant tech community on one of the world's most > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > > > > Dmtcp-forum mailing list > > > > Dmtcp-forum@lists.sourceforge.net > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > > ------------------------------------------------------------------------------ > > Check out the vibrant tech community on one of the world's most > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > _______________________________________________ > > Dmtcp-forum mailing list > > Dmtcp-forum@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum