[Dmtcp-forum] indefinite stall within a python pipeline

Rob Egan Fri, 28 Jul 2017 18:29:38 -0700

Hello,

I am trying to adopt dmtcp at our institution and ran into a problem with
the first script that I tried.  The script seems to start fine and perform
a few operations, but then stalls indefinitely consuming 0 cpu in what
looks like a futex wait loop after the process was cloned.  It is a
multi-threaded program and the behavior is the same whether 1 or more
threads is requested.


This is a 3rd party python script, so unfortunately I can't share the code
but I uploaded the strace from a 1-threaded execution:

dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
<http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz>
http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz

This is how I executed the dmctp & strace & script:

dmtcp_launch  strace ~/.local/bin/read_fast5_basecaller.py -i uploaded/0/
-o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107 --kit SQK-LSK308


And here is the output from the dmtcp_coordinator (which I commanded to
kill after a few minutes of inactivity):


regan@gpint200:/global/projectb/scratch/regan/nanopore/runs/X0124-onebatch2$
dmtcp_coordinator
dmtcp_coordinator starting...
    Host: gpint200 (127.0.0.2)
    Port: 7779
    Checkpoint Interval: 18000
    Exit on last client: 0
Type '?' for help.

[10273] NOTE at dmtcp_coordinator.cpp:1675 in updateCheckpointInterval;
REASON='CheckpointInterval updated (for this computation only)'
     oldInterval = 18000
     theCheckpointInterval = 18000
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
connected'
     hello_remote.from = 704891f6494a6304-10275-597be341
[10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
process Information after exec()'
     progname = strace
     msg.from = 704891f6494a6304-40000-597be341
     client->identity() = 704891f6494a6304-10275-597be341
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
connected'
     hello_remote.from = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
process Information after fork()'
     client->hostname() = gpint200
     client->progname() = strace_(forked)
     msg.from = 704891f6494a6304-41000-597be341
     client->identity() = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
process Information after exec()'
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
connected'
     hello_remote.from = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
process Information after fork()'
     client->hostname() = gpint200
     client->progname() = strace_(forked)
     msg.from = 704891f6494a6304-41000-597be341
     client->identity() = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
process Information after exec()'
     progname = python3.4
     msg.from = 704891f6494a6304-41000-597be341
     client->identity() = 704891f6494a6304-41000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
connected'
     hello_remote.from = 704891f6494a6304-41000-597be341
k
[10273] NOTE at dmtcp_coordinator.cpp:603 in handleUserCommand;
REASON='Killing all connected Peers...'
[10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
disconnected'
     client->identity() = 704891f6494a6304-40000-597be341
     client->progname() = strace
[10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
disconnected'
     client->identity() = 704891f6494a6304-41000-597be341
     client->progname() = python3.4_(forked)
[10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
disconnected'
     client->identity() = 704891f6494a6304-41000-597be341
     client->progname() = python3.4
l
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
q
[10273] NOTE at dmtcp_coordinator.cpp:590 in handleUserCommand;
REASON='killing all connected peers and quitting ...'
DMTCP coordinator exiting... (per request)


This is a process that on large data sets can take a few days on a 32 core
node, so it is something that we would really like to be able to checkpoint
on our cluster.

Please let me know how I can help debug this with you.  I've tried versions
1.2.5 and 2.4.8 and 2.5 and they all have the same problem.

Thanks,
Rob Egan

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

[Dmtcp-forum] indefinite stall within a python pipeline

Reply via email to