
I am trying to adopt dmtcp at our institution and ran into a problem with
the first script that I tried.  The script seems to start fine and perform
a few operations, but then stalls indefinitely consuming 0 cpu in what
looks like a futex wait loop after the process was cloned.  It is a
multi-threaded program and the behavior is the same whether 1 or more
threads is requested.

This is a 3rd party python script, so unfortunately I can't share the code
but I uploaded the strace from a 1-threaded execution:


This is how I executed the dmctp & strace & script:

dmtcp_launch  strace ~/.local/bin/read_fast5_basecaller.py -i uploaded/0/
-o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107 --kit SQK-LSK308

And here is the output from the dmtcp_coordinator (which I commanded to
kill after a few minutes of inactivity):

dmtcp_coordinator starting...
    Host: gpint200 (
    Port: 7779
    Checkpoint Interval: 18000
    Exit on last client: 0
Type '?' for help.

[10273] NOTE at dmtcp_coordinator.cpp:1675 in updateCheckpointInterval;
REASON='CheckpointInterval updated (for this computation only)'
     oldInterval = 18000
     theCheckpointInterval = 18000
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
     hello_remote.from = 704891f6494a6304-10275-597be341
[10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
process Information after exec()'
     progname = strace
     msg.from = 704891f6494a6304-40000-597be341
     client->identity() = 704891f6494a6304-10275-597be341
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
     hello_remote.from = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
process Information after fork()'
     client->hostname() = gpint200
     client->progname() = strace_(forked)
     msg.from = 704891f6494a6304-41000-597be341
     client->identity() = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
process Information after exec()'
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
     hello_remote.from = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
process Information after fork()'
     client->hostname() = gpint200
     client->progname() = strace_(forked)
     msg.from = 704891f6494a6304-41000-597be341
     client->identity() = 704891f6494a6304-40000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
process Information after exec()'
     progname = python3.4
     msg.from = 704891f6494a6304-41000-597be341
     client->identity() = 704891f6494a6304-41000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
     hello_remote.from = 704891f6494a6304-41000-597be341
[10273] NOTE at dmtcp_coordinator.cpp:603 in handleUserCommand;
REASON='Killing all connected Peers...'
[10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
     client->identity() = 704891f6494a6304-40000-597be341
     client->progname() = strace
[10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
     client->identity() = 704891f6494a6304-41000-597be341
     client->progname() = python3.4_(forked)
[10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
     client->identity() = 704891f6494a6304-41000-597be341
     client->progname() = python3.4
Client List:
[10273] NOTE at dmtcp_coordinator.cpp:590 in handleUserCommand;
REASON='killing all connected peers and quitting ...'
DMTCP coordinator exiting... (per request)

This is a process that on large data sets can take a few days on a 32 core
node, so it is something that we would really like to be able to checkpoint
on our cluster.

Please let me know how I can help debug this with you.  I've tried versions
1.2.5 and 2.4.8 and 2.5 and they all have the same problem.

Rob Egan
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Dmtcp-forum mailing list

Reply via email to