Hi Kosta,
    Yes, if we can reproduce the bug locally here, that would be helpful.
But we'd like to work with you on solving this, regardless.
    Are you using DMTCP version 1.2.8 (the latest released version)?
    Also, would you mind trying our svn trunk?  This is the basis for
a soon-to-be-released DMTCP version 2.0.  To get the svn trunk, do:
   svn co svn://svn.code.sf.net/p/dmtcp/code/trunk dmtcp-trunk 
   (from http://dmtcp.sourceforge.net/downloads.html)
Thanks,
- Gene

On Fri, Aug 16, 2013 at 11:10:44PM -0400, Kapil Arya wrote:
> Hi Kosta,
> 
> >From the backtrace of the main thread, it looks like the main thread didn't
> receive the checkpoint signal (SIGUSR2) from the ckpt-thread. I am not sure
> what caused it. Is this something I can reproduce on one of my local
> machines?
> 
> Kapil
> 
> 
> On Fri, Aug 16, 2013 at 7:55 PM, Kosta Malolin
> <kosta.malo...@ericsson.com>wrote:
> 
> >  I am seeing this issue when trying to dump a state of an application.****
> >
> > Here is the state of the application when examined with gdb :****
> >
> > ** **
> >
> > (gdb) info thread****
> >
> >   Id   Target Id         Frame ****
> >
> >   3    Thread 0x422e1940 (LWP 12723) 0x00002b69d547b23f in mtcp_futex
> > (uaddr=0x2b69d5687fd8, op=0, val=2, timeout=0x2b69d547d280) at
> > mtcp_futex.h:24****
> >
> >   2    Thread 0x42ce2940 (LWP 12724) 0x00002b69d547b23f in mtcp_futex
> > (uaddr=0x22371848, op=0, val=5, timeout=0x0) at mtcp_futex.h:24****
> >
> > * 1    Thread 0x2b69d8b19f20 (LWP 12722) 0x00002b69d6d86541 in nanosleep
> > () from /lib64/libc.so.6****
> >
> > (gdb) bt****
> >
> > #0  0x00002b69d6d86541 in nanosleep () from /lib64/libc.so.6****
> >
> > #1  0x00002b69d6db9ed4 in usleep () from /lib64/libc.so.6****
> >
> > #2  0x000000000040cdbc in AmberSimLoop::simLoop() ()****
> >
> > #3  0x000000000040799b in main ()****
> >
> > (gdb) thread 2****
> >
> > [Switching to thread 2 (Thread 0x42ce2940 (LWP 12724))]****
> >
> > #0  0x00002b69d547b23f in mtcp_futex (uaddr=0x22371848, op=0, val=5,
> > timeout=0x0) at mtcp_futex.h:24****
> >
> > 24        asm volatile ("syscall"****
> >
> > (gdb) bt****
> >
> > #0  0x00002b69d547b23f in mtcp_futex (uaddr=0x22371848, op=0, val=5,
> > timeout=0x0) at mtcp_futex.h:24****
> >
> > #1  0x00002b69d547b1e4 in mtcp_state_futex (state=0x22371848, func=0,
> > val=5, timeout=0x0) at mtcp_state.c:47****
> >
> > #2  0x00002b69d54739a7 in stopthisthread (signum=12) at mtcp.c:3474****
> >
> > #3  <signal handler called>****
> >
> > #4  0x00002b69d6dc08a8 in epoll_wait () from /lib64/libc.so.6****
> >
> > #5  0x00000000007d7bcd in AmberPciePortHandler::handleSlaveRequests() ()**
> > **
> >
> > #6  0x000000000040ef59 in spawnPcieServer(void*) ()****
> >
> > #7  0x00002b69d5cd373d in start_thread () from /lib64/libpthread.so.0****
> >
> > #8  0x00002b69d546e957 in threadcloned (threadv=0x22371830) at mtcp.c:1231
> > ****
> >
> > #9  0x00002b69d6dc04bd in clone () from /lib64/libc.so.6****
> >
> > #10 0x0000000000000000 in ?? ()****
> >
> > (gdb) thread 3 ****
> >
> > [Switching to thread 3 (Thread 0x422e1940 (LWP 12723))]****
> >
> > #0  0x00002b69d547b23f in mtcp_futex (uaddr=0x2b69d5687fd8, op=0, val=2,
> > timeout=0x2b69d547d280) at mtcp_futex.h:24****
> >
> > 24        asm volatile ("syscall"****
> >
> > (gdb) bt****
> >
> > #0  0x00002b69d547b23f in mtcp_futex (uaddr=0x2b69d5687fd8, op=0, val=2,
> > timeout=0x2b69d547d280) at mtcp_futex.h:24****
> >
> > #1  0x00002b69d547b1e4 in mtcp_state_futex (state=0x2b69d5687fd8, func=0,
> > val=2, timeout=0x2b69d547d280) at mtcp_state.c:47****
> >
> > #2  0x00002b69d546fc90 in checkpointhread (dummy=0x0) at mtcp.c:1998****
> >
> > #3  0x00002b69d5cd373d in start_thread () from /lib64/libpthread.so.0****
> >
> > #4  0x00002b69d546e957 in threadcloned (threadv=0x1b98fb70) at mtcp.c:1231
> > ****
> >
> > #5  0x00002b69d6dc04bd in clone () from /lib64/libc.so.6****
> >
> > #6  0x0000000000000000 in ?? ()****
> >
> > (gdb)****
> >
> > ** **
> >
> > Apparently, and attempt to dump checkpoint was taken when the thread 1 was
> > in nanosleep() and the thread 2 in epoll_wait()****
> >
> > This resulted in a deadlock. Any ideas on what is going on ?****
> >
> > ** **
> >
> > -Kosta****
> >
> >
> > ------------------------------------------------------------------------------
> > Get 100% visibility into Java/.NET code with AppDynamics Lite!
> > It's a free troubleshooting tool designed for production.
> > Get down to code-level detail for bottlenecks, with <2% overhead.
> > Download for free and get started troubleshooting in minutes.
> > http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >
> >

> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite!
> It's a free troubleshooting tool designed for production.
> Get down to code-level detail for bottlenecks, with <2% overhead. 
> Download for free and get started troubleshooting in minutes. 
> http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk

> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to