I am using the DMTCP release 1.2.8 Actually, I am not interested in the distributed functionality, just want to be able to save/restore the state of a single process. Here is my usage scenario :
In the main() of my program I have a call to mtcp_init() : mtcp_init ("/tmp/myprog.mtcp", 5, 1); This is done before any pthread_create() are called. I run the program like this : env LD_PRELOAD=<location of my libmtcp.so> myprog ... Am I doing everything right ? The main thread seems to be making progress, but the thread 2 is stuck in the epoll_wait () forever. Here is the man for epoll_wait() : --------------------------------------------------- SYNOPSIS #include <sys/epoll.h> int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout); DESCRIPTION Wait for events on the epoll file descriptor epfd for a maximum time of timeout milliseconds. The memory area pointed to by events will contain the events that will be available for the caller. Up to maxevents are returned by epoll_wait(2). The maxevents parameter must be greater than zero. Specifying a timeout of -1 makes epoll_wait(2) wait indefi- nitely, while specifying a timeout equal to zero makes epoll_wait(2) to return immediately even if no events are available (return code equal to zero). --------------------------------------------------- In my case, epoll_wait() is called with timeout == -1, i.e. it is supposed to wait forever. This thread in my program handles the communication with some other process. It never makes progress after the signal to dump the checkpoint is received when the thread is in epoll_wait(). May be this issue is somehow related to the epoll_wait() specifics ? I'll try to create a standalone program that reproduces this issue. -Kosta -----Original Message----- From: gene [mailto:g...@ccs.neu.edu] Sent: Saturday, August 17, 2013 9:30 PM To: Kapil Arya Cc: Kosta Malolin; dmtcp-forum@lists.sourceforge.net Subject: Re: [Dmtcp-forum] mtcp deadlocked Hi Kosta, Yes, if we can reproduce the bug locally here, that would be helpful. But we'd like to work with you on solving this, regardless. Are you using DMTCP version 1.2.8 (the latest released version)? Also, would you mind trying our svn trunk? This is the basis for a soon-to-be-released DMTCP version 2.0. To get the svn trunk, do: svn co svn://svn.code.sf.net/p/dmtcp/code/trunk dmtcp-trunk (from http://dmtcp.sourceforge.net/downloads.html) Thanks, - Gene On Fri, Aug 16, 2013 at 11:10:44PM -0400, Kapil Arya wrote: > Hi Kosta, > > >From the backtrace of the main thread, it looks like the main thread > >didn't > receive the checkpoint signal (SIGUSR2) from the ckpt-thread. I am not > sure what caused it. Is this something I can reproduce on one of my > local machines? > > Kapil > > > On Fri, Aug 16, 2013 at 7:55 PM, Kosta Malolin > <kosta.malo...@ericsson.com>wrote: > > > I am seeing this issue when trying to dump a state of an > > application.**** > > > > Here is the state of the application when examined with gdb :**** > > > > ** ** > > > > (gdb) info thread**** > > > > Id Target Id Frame **** > > > > 3 Thread 0x422e1940 (LWP 12723) 0x00002b69d547b23f in mtcp_futex > > (uaddr=0x2b69d5687fd8, op=0, val=2, timeout=0x2b69d547d280) at > > mtcp_futex.h:24**** > > > > 2 Thread 0x42ce2940 (LWP 12724) 0x00002b69d547b23f in mtcp_futex > > (uaddr=0x22371848, op=0, val=5, timeout=0x0) at mtcp_futex.h:24**** > > > > * 1 Thread 0x2b69d8b19f20 (LWP 12722) 0x00002b69d6d86541 in nanosleep > > () from /lib64/libc.so.6**** > > > > (gdb) bt**** > > > > #0 0x00002b69d6d86541 in nanosleep () from /lib64/libc.so.6**** > > > > #1 0x00002b69d6db9ed4 in usleep () from /lib64/libc.so.6**** > > > > #2 0x000000000040cdbc in AmberSimLoop::simLoop() ()**** > > > > #3 0x000000000040799b in main ()**** > > > > (gdb) thread 2**** > > > > [Switching to thread 2 (Thread 0x42ce2940 (LWP 12724))]**** > > > > #0 0x00002b69d547b23f in mtcp_futex (uaddr=0x22371848, op=0, val=5, > > timeout=0x0) at mtcp_futex.h:24**** > > > > 24 asm volatile ("syscall"**** > > > > (gdb) bt**** > > > > #0 0x00002b69d547b23f in mtcp_futex (uaddr=0x22371848, op=0, val=5, > > timeout=0x0) at mtcp_futex.h:24**** > > > > #1 0x00002b69d547b1e4 in mtcp_state_futex (state=0x22371848, > > func=0, val=5, timeout=0x0) at mtcp_state.c:47**** > > > > #2 0x00002b69d54739a7 in stopthisthread (signum=12) at > > mtcp.c:3474**** > > > > #3 <signal handler called>**** > > > > #4 0x00002b69d6dc08a8 in epoll_wait () from /lib64/libc.so.6**** > > > > #5 0x00000000007d7bcd in > > AmberPciePortHandler::handleSlaveRequests() ()** > > ** > > > > #6 0x000000000040ef59 in spawnPcieServer(void*) ()**** > > > > #7 0x00002b69d5cd373d in start_thread () from > > /lib64/libpthread.so.0**** > > > > #8 0x00002b69d546e957 in threadcloned (threadv=0x22371830) at > > mtcp.c:1231 > > **** > > > > #9 0x00002b69d6dc04bd in clone () from /lib64/libc.so.6**** > > > > #10 0x0000000000000000 in ?? ()**** > > > > (gdb) thread 3 **** > > > > [Switching to thread 3 (Thread 0x422e1940 (LWP 12723))]**** > > > > #0 0x00002b69d547b23f in mtcp_futex (uaddr=0x2b69d5687fd8, op=0, > > val=2, > > timeout=0x2b69d547d280) at mtcp_futex.h:24**** > > > > 24 asm volatile ("syscall"**** > > > > (gdb) bt**** > > > > #0 0x00002b69d547b23f in mtcp_futex (uaddr=0x2b69d5687fd8, op=0, > > val=2, > > timeout=0x2b69d547d280) at mtcp_futex.h:24**** > > > > #1 0x00002b69d547b1e4 in mtcp_state_futex (state=0x2b69d5687fd8, > > func=0, val=2, timeout=0x2b69d547d280) at mtcp_state.c:47**** > > > > #2 0x00002b69d546fc90 in checkpointhread (dummy=0x0) at > > mtcp.c:1998**** > > > > #3 0x00002b69d5cd373d in start_thread () from > > /lib64/libpthread.so.0**** > > > > #4 0x00002b69d546e957 in threadcloned (threadv=0x1b98fb70) at > > mtcp.c:1231 > > **** > > > > #5 0x00002b69d6dc04bd in clone () from /lib64/libc.so.6**** > > > > #6 0x0000000000000000 in ?? ()**** > > > > (gdb)**** > > > > ** ** > > > > Apparently, and attempt to dump checkpoint was taken when the thread > > 1 was in nanosleep() and the thread 2 in epoll_wait()**** > > > > This resulted in a deadlock. Any ideas on what is going on ?**** > > > > ** ** > > > > -Kosta**** > > > > > > -------------------------------------------------------------------- > > ---------- Get 100% visibility into Java/.NET code with AppDynamics > > Lite! > > It's a free troubleshooting tool designed for production. > > Get down to code-level detail for bottlenecks, with <2% overhead. > > Download for free and get started troubleshooting in minutes. > > http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg > > .clktrk _______________________________________________ > > Dmtcp-forum mailing list > > Dmtcp-forum@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > ---------------------------------------------------------------------- > -------- Get 100% visibility into Java/.NET code with AppDynamics > Lite! > It's a free troubleshooting tool designed for production. > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.c > lktrk > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum