Hi Joshua,

Thanks for the detailed bug report and the sample program :-).

Now before we go further, I want to inform you of the updated svn
repository access (due to sorceforge project update. Here is the new URL:

Trunk: svn checkout svn://svn.code.sf.net/p/dmtcp/code/trunk dmtcp-trunk
1.2: svn checkout svn://svn.code.sf.net/p/dmtcp/code/branches/1.2 dmtcp-1.2


I have now fixed this particular problem in the svn. However, there is
still a weird case that can prevent checkpoint from happening and I am not
sure about the correct approach at the moment.

dlopen() holds a lock while loading shared libraries. And if we try to
checkpoint while in the middle of dlopen(), it causes a deadlock because
the ckpt-thread tries to call dlsym() which in turn requires this lock :-(.

To prevent this deadlock, we disable checkpointing while in the middle of
dlopen(). It works well in normal cases. However, in situations where a
static initialization results in a new process due to fork/exec, the
computation deadlocks. Here is how. The parent process has received
checkpoint-message from coordinator, but waits for the dlopen() function to
return. In the meanwhile, the child process also received the checkpoint
message from the coordinator and suspends itself. Thus the parent is
waiting for the child that is suspended :-(.

If the child processes are short lived, this should not be a problem, but
in the rare cases, where the checkpoint command is given while
initialization of shared library, deadlocks can occurr.

As I said earlier, I am not sure what the correct approach would be at this
point. Do you have any suggestions?

Thanks,
Kapil


On Mon, Mar 4, 2013 at 3:32 PM, Gene Cooperman <[email protected]> wrote:

> Hi Joshua,
>     Thanks for uncovering this bug, and the precise documentation of the
> bug.
> This is really helpful.  We expect to be able to fix this quickly.  It is
> typically caused in exactly the way that you document.  User code holds a
> lock defined
> by one of the run-time libraries, and DMTCP calling a function that uses
> the same lock
> at checkpoint time.  DMTCP tries to be conservative at checkpoint time and
> avoid
> situations in which it might grab a lock already held by user code.  You
> seem to have
> uncovered an analogous situation occuring during static initialization.
>  Thanks
> for reporting this.
>
> - Gene
>
> On Mon, Mar 04, 2013 at 07:03:31PM +0000, Louie, Joshua D wrote:
> > Hi,
> >     I've run into an issue where static initialization/ or constructor
> attribute functions that are called in a loaded shared object causes a
> hang. All versions I've tried (1.2.4 - eventual release of 2.0.0) hit the
> same issue. Here's the scenario (and I've attached sample code to reproduce
> the issue). The main code opens a shared object with dlopen, and one of the
> static initialization functions does a fork or system command. For
> fork/execvp/execvpe/execve, they all have to grab the lock with write
> permissions. The problem is that before we call the actual dlopen, we have
> to grab the lock with read permissions. Normally we release the lock after
> dlopen stuff all finishes. The problem is that we're not done with dlopen,
> so as a result the call trying to get the write lock can't get it since
> there's still a reader waiting on it. With my particular situation, I can
> make progress by not grabbing the lock on the dlopen, since I have well
> defined times as to when a checkpoint wi
>  ll occur, but I wanted to bring thi
> > s to your attention, so you all can figure out the best way to deal with
> it.
> >
> > ---------- dlopen_test.c ----------
> > #include<cstdio>
> > #include<dlfcn.h>
> > #include<dmtcpaware.h>
> >
> > typedef int (*print_fn)(void);
> >
> > int main() {
> >     printf("Opening ./printer.so\n");
> >     void *so = dlopen("./printer.so", RTLD_LOCAL| RTLD_LAZY);
> >     printf("Done opening ./printer.so\n");
> >
> >     print_fn print_func;
> >
> >     print_func = (print_fn)dlsym(so, "print_func");
> >     print_func();
> >     return 0;
> > }
> >
> > ---------- printer.c ----------
> > #include<cstdio>
> > #include<cstdlib>
> >
> > extern "C" int
> > print_constructor() {
> >     printf("    In print_constructor\n");
> >     system("echo '    Will I hang?'"); // This is where in DMTCP
> environment, it hangs
> >     printf("    Leaving print_constructor\n");
> >     return 0;
> > }
> >
> > extern "C" int
> > print_func() {
> >     printf("    In print_func\n");
> >     return 0;
> > }
> >
> > static int value = print_constructor();
> >
> > ---------- Expected results ----------
> > Opening ./printer.so
> >     In print_constructor
> >     Will I hang?
> >     Leaving print_constructor
> > Done opening ./printer.so
> >     In print_func
> >
> > Joshua Louie
> >
>
> >
> ------------------------------------------------------------------------------
> > Everyone hates slow websites. So do we.
> > Make your web apps faster with AppDynamics
> > Download AppDynamics Lite for free today:
> > http://p.sf.net/sfu/appdyn_d2d_feb
>
> > _______________________________________________
> > Dmtcp-forum mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to