I just realized that there's an open issue related to Python and Numpy. See here: https://github.com/dmtcp/dmtcp/issues/599. Unfortunately, I haven't been able to spend enough time to get a fix out.
Could you please verify if this isn't what's going on? Even if it's not the exact same libraries, I think the bug, as reported on the Github issue, is more general and could manifest through different libraries, although with a similar backtrace signature. On Fri, Jul 28, 2017 at 09:53:40PM -0400, Rohan Garg wrote: > Hi Rob, > > Thanks for trying out DMTCP and reporting this issue. We'd be happy to > help resolve this issue. > > I'll take a look at the strace, but probably the most useful thing > here would be to try to figure out the process that's hanging and > to look at its backtrace. The backtrace from the parent could also > be useful. You should be able to attach gdb to the interesting > processes and look at the backtrace. > > $ gdb attach <pid> > ... > (gdb) thread apply all bt > > This can hopefully tell us the reason for the deadlock. Also, as a quick > test could you please try running your program with the following DMTCP > options? > > $ dmtcp_launch --disable-alloc-plugin --disable-dl-plugin > <program-and-program-arguments> > > Also, I'd recommend working with the 2.4.x/2.5.x versions; we are > not maintaining the 1.x branch any longer. > > (I do have a NERSC account and have recently diagnosed some issues > with the CRAY toolchain/srun and DMTCP there. But it seems like > that that's not an issue here.) > > Thanks, > Rohan > > On Fri, Jul 28, 2017 at 06:28:03PM -0700, Rob Egan wrote: > > Hello, > > > > I am trying to adopt dmtcp at our institution and ran into a problem with > > the first script that I tried. The script seems to start fine and perform > > a few operations, but then stalls indefinitely consuming 0 cpu in what > > looks like a futex wait loop after the process was cloned. It is a > > multi-threaded program and the behavior is the same whether 1 or more > > threads is requested. > > > > This is a 3rd party python script, so unfortunately I can't share the code > > but I uploaded the strace from a 1-threaded execution: > > > > dmtcp-stall-read_fast5_basecaller.py-strace.log.gz > > <http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz> > > http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz > > > > This is how I executed the dmctp & strace & script: > > > > dmtcp_launch strace ~/.local/bin/read_fast5_basecaller.py -i uploaded/0/ > > -o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107 --kit SQK-LSK308 > > > > > > And here is the output from the dmtcp_coordinator (which I commanded to > > kill after a few minutes of inactivity): > > > > > > regan@gpint200:/global/projectb/scratch/regan/nanopore/runs/X0124-onebatch2$ > > dmtcp_coordinator > > dmtcp_coordinator starting... > > Host: gpint200 (127.0.0.2) > > Port: 7779 > > Checkpoint Interval: 18000 > > Exit on last client: 0 > > Type '?' for help. > > > > [10273] NOTE at dmtcp_coordinator.cpp:1675 in updateCheckpointInterval; > > REASON='CheckpointInterval updated (for this computation only)' > > oldInterval = 18000 > > theCheckpointInterval = 18000 > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker > > connected' > > hello_remote.from = 704891f6494a6304-10275-597be341 > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating > > process Information after exec()' > > progname = strace > > msg.from = 704891f6494a6304-40000-597be341 > > client->identity() = 704891f6494a6304-10275-597be341 > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker > > connected' > > hello_remote.from = 704891f6494a6304-40000-597be341 > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating > > process Information after fork()' > > client->hostname() = gpint200 > > client->progname() = strace_(forked) > > msg.from = 704891f6494a6304-41000-597be341 > > client->identity() = 704891f6494a6304-40000-597be341 > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating > > process Information after exec()' > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker > > connected' > > hello_remote.from = 704891f6494a6304-40000-597be341 > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating > > process Information after fork()' > > client->hostname() = gpint200 > > client->progname() = strace_(forked) > > msg.from = 704891f6494a6304-41000-597be341 > > client->identity() = 704891f6494a6304-40000-597be341 > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating > > process Information after exec()' > > progname = python3.4 > > msg.from = 704891f6494a6304-41000-597be341 > > client->identity() = 704891f6494a6304-41000-597be341 > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker > > connected' > > hello_remote.from = 704891f6494a6304-41000-597be341 > > k > > [10273] NOTE at dmtcp_coordinator.cpp:603 in handleUserCommand; > > REASON='Killing all connected Peers...' > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client > > disconnected' > > client->identity() = 704891f6494a6304-40000-597be341 > > client->progname() = strace > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client > > disconnected' > > client->identity() = 704891f6494a6304-41000-597be341 > > client->progname() = python3.4_(forked) > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client > > disconnected' > > client->identity() = 704891f6494a6304-41000-597be341 > > client->progname() = python3.4 > > l > > Client List: > > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > > q > > [10273] NOTE at dmtcp_coordinator.cpp:590 in handleUserCommand; > > REASON='killing all connected peers and quitting ...' > > DMTCP coordinator exiting... (per request) > > > > > > This is a process that on large data sets can take a few days on a 32 core > > node, so it is something that we would really like to be able to checkpoint > > on our cluster. > > > > Please let me know how I can help debug this with you. I've tried versions > > 1.2.5 and 2.4.8 and 2.5 and they all have the same problem. > > > > Thanks, > > Rob Egan > > > ------------------------------------------------------------------------------ > > Check out the vibrant tech community on one of the world's most > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > _______________________________________________ > > Dmtcp-forum mailing list > > Dmtcp-forum@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum