Re: [Dmtcp-forum] indefinite stall within a python pipeline

Rohan Garg Sat, 29 Jul 2017 07:22:30 -0700

I just realized that there's an open issue related to Python and
Numpy.  See here: https://github.com/dmtcp/dmtcp/issues/599.
Unfortunately, I haven't been able to spend enough time to get a
fix out.


Could you please verify if this isn't what's going on? Even if it's
not the exact same libraries, I think the bug, as reported on the
Github issue, is more general and could manifest through different
libraries, although with a similar backtrace signature.

On Fri, Jul 28, 2017 at 09:53:40PM -0400, Rohan Garg wrote:
> Hi Rob,
> 
> Thanks for trying out DMTCP and reporting this issue. We'd be happy to
> help resolve this issue.
> 
> I'll take a look at the strace, but probably the most useful thing
> here would be to try to figure out the process that's hanging and
> to look at its backtrace. The backtrace from the parent could also
> be useful. You should be able to attach gdb to the interesting
> processes and look at the backtrace.
> 
>     $ gdb attach <pid>
>     ...
>     (gdb) thread apply all bt
> 
> This can hopefully tell us the reason for the deadlock. Also, as a quick
> test could you please try running your program with the following DMTCP
> options?
> 
>     $ dmtcp_launch --disable-alloc-plugin --disable-dl-plugin 
> <program-and-program-arguments>
> 
> Also, I'd recommend working with the 2.4.x/2.5.x versions; we are
> not maintaining the 1.x branch any longer.
> 
> (I do have a NERSC account and have recently diagnosed some issues
> with the CRAY toolchain/srun and DMTCP there. But it seems like
> that that's not an issue here.)
> 
> Thanks,
> Rohan
> 
> On Fri, Jul 28, 2017 at 06:28:03PM -0700, Rob Egan wrote:
> > Hello,
> > 
> > I am trying to adopt dmtcp at our institution and ran into a problem with
> > the first script that I tried.  The script seems to start fine and perform
> > a few operations, but then stalls indefinitely consuming 0 cpu in what
> > looks like a futex wait loop after the process was cloned.  It is a
> > multi-threaded program and the behavior is the same whether 1 or more
> > threads is requested.
> > 
> > This is a 3rd party python script, so unfortunately I can't share the code
> > but I uploaded the strace from a 1-threaded execution:
> > 
> > dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
> > <http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz>
> > http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
> > 
> > This is how I executed the dmctp & strace & script:
> > 
> > dmtcp_launch  strace ~/.local/bin/read_fast5_basecaller.py -i uploaded/0/
> > -o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107 --kit SQK-LSK308
> > 
> > 
> > And here is the output from the dmtcp_coordinator (which I commanded to
> > kill after a few minutes of inactivity):
> > 
> > 
> > regan@gpint200:/global/projectb/scratch/regan/nanopore/runs/X0124-onebatch2$
> > dmtcp_coordinator
> > dmtcp_coordinator starting...
> >     Host: gpint200 (127.0.0.2)
> >     Port: 7779
> >     Checkpoint Interval: 18000
> >     Exit on last client: 0
> > Type '?' for help.
> > 
> > [10273] NOTE at dmtcp_coordinator.cpp:1675 in updateCheckpointInterval;
> > REASON='CheckpointInterval updated (for this computation only)'
> >      oldInterval = 18000
> >      theCheckpointInterval = 18000
> > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 704891f6494a6304-10275-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
> > process Information after exec()'
> >      progname = strace
> >      msg.from = 704891f6494a6304-40000-597be341
> >      client->identity() = 704891f6494a6304-10275-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 704891f6494a6304-40000-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
> > process Information after fork()'
> >      client->hostname() = gpint200
> >      client->progname() = strace_(forked)
> >      msg.from = 704891f6494a6304-41000-597be341
> >      client->identity() = 704891f6494a6304-40000-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
> > process Information after exec()'
> > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 704891f6494a6304-40000-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
> > process Information after fork()'
> >      client->hostname() = gpint200
> >      client->progname() = strace_(forked)
> >      msg.from = 704891f6494a6304-41000-597be341
> >      client->identity() = 704891f6494a6304-40000-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
> > process Information after exec()'
> >      progname = python3.4
> >      msg.from = 704891f6494a6304-41000-597be341
> >      client->identity() = 704891f6494a6304-41000-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 704891f6494a6304-41000-597be341
> > k
> > [10273] NOTE at dmtcp_coordinator.cpp:603 in handleUserCommand;
> > REASON='Killing all connected Peers...'
> > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
> > disconnected'
> >      client->identity() = 704891f6494a6304-40000-597be341
> >      client->progname() = strace
> > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
> > disconnected'
> >      client->identity() = 704891f6494a6304-41000-597be341
> >      client->progname() = python3.4_(forked)
> > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
> > disconnected'
> >      client->identity() = 704891f6494a6304-41000-597be341
> >      client->progname() = python3.4
> > l
> > Client List:
> > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
> > q
> > [10273] NOTE at dmtcp_coordinator.cpp:590 in handleUserCommand;
> > REASON='killing all connected peers and quitting ...'
> > DMTCP coordinator exiting... (per request)
> > 
> > 
> > This is a process that on large data sets can take a few days on a 32 core
> > node, so it is something that we would really like to be able to checkpoint
> > on our cluster.
> > 
> > Please let me know how I can help debug this with you.  I've tried versions
> > 1.2.5 and 2.4.8 and 2.5 and they all have the same problem.
> > 
> > Thanks,
> > Rob Egan
> 
> > ------------------------------------------------------------------------------
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> > _______________________________________________
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Re: [Dmtcp-forum] indefinite stall within a python pipeline

Reply via email to