Hello,
I am trying to adopt dmtcp at our institution and ran into a problem with
the first script that I tried. The script seems to start fine and perform
a few operations, but then stalls indefinitely consuming 0 cpu in what
looks like a futex wait loop after the process was cloned. It is a
multi-threaded program and the behavior is the same whether 1 or more
threads is requested.
This is a 3rd party python script, so unfortunately I can't share the code
but I uploaded the strace from a 1-threaded execution:
dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
<http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz>
http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
This is how I executed the dmctp & strace & script:
dmtcp_launch strace ~/.local/bin/read_fast5_basecaller.py -i uploaded/0/
-o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107 --kit SQK-LSK308
And here is the output from the dmtcp_coordinator (which I commanded to
kill after a few minutes of inactivity):
regan@gpint200:/global/projectb/scratch/regan/nanopore/runs/X0124-onebatch2$
dmtcp_coordinator
dmtcp_coordinator starting...
Host: gpint200 (127.0.0.2)
Port: 7779
Checkpoint Interval: 18000
Exit on last client: 0
Type '?' for help.
[10273] NOTE at dmtcp_coordinator.cpp:1675 in updateCheckpointInterval;
REASON='CheckpointInterval updated (for this computation only)'
oldInterval = 18000
theCheckpointInterval = 18000
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
connected'
hello_remote.from = 704891f6494a6304-10275-597be341
[10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
process Information after exec()'
progname = strace
msg.from = 704891f6494a6304-40000-597be341
client->identity() = 704891f6494a6304-10275-597be341
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
connected'
hello_remote.from = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
process Information after fork()'
client->hostname() = gpint200
client->progname() = strace_(forked)
msg.from = 704891f6494a6304-41000-597be341
client->identity() = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
process Information after exec()'
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
connected'
hello_remote.from = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
process Information after fork()'
client->hostname() = gpint200
client->progname() = strace_(forked)
msg.from = 704891f6494a6304-41000-597be341
client->identity() = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
process Information after exec()'
progname = python3.4
msg.from = 704891f6494a6304-41000-597be341
client->identity() = 704891f6494a6304-41000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
connected'
hello_remote.from = 704891f6494a6304-41000-597be341
k
[10273] NOTE at dmtcp_coordinator.cpp:603 in handleUserCommand;
REASON='Killing all connected Peers...'
[10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
disconnected'
client->identity() = 704891f6494a6304-40000-597be341
client->progname() = strace
[10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
disconnected'
client->identity() = 704891f6494a6304-41000-597be341
client->progname() = python3.4_(forked)
[10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
disconnected'
client->identity() = 704891f6494a6304-41000-597be341
client->progname() = python3.4
l
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
q
[10273] NOTE at dmtcp_coordinator.cpp:590 in handleUserCommand;
REASON='killing all connected peers and quitting ...'
DMTCP coordinator exiting... (per request)
This is a process that on large data sets can take a few days on a 32 core
node, so it is something that we would really like to be able to checkpoint
on our cluster.
Please let me know how I can help debug this with you. I've tried versions
1.2.5 and 2.4.8 and 2.5 and they all have the same problem.
Thanks,
Rob Egan
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum