I've spent some time in the debugger.  As you've probably seen, the action
happens in dlerror.c around line 95.

92          buf = (char *) result->errstring;
93          int n;
94          if (result->errcode == 0)
95        n = __asprintf (&buf, "%s%s%s",
96                result->objname,
97                result->objname[0] == '\0' ? "" : ": ",
98                _(result->errstring));

When this function first executes, the result struct is happy:

(gdb) print *result
$2 = {errcode = 0, returned = 0, malloced = true,
  objname = 0x61884b
"/usr/lib64/ImageMagick-6.7.2/modules-Q16/coders/png.so",
  errstring = 0x618820 "undefined symbol: png_LTX_RegisterPNGImage"}

However it is not so simple.  If I step into that call, I end up in
dcgettext.c which is apparently trying to operate on result->errstring.

(gdb) step
*__GI___dcgettext (domainname=0x380e755c95 <_libc_intl_domainname> "libc",
    msgid=0x618820 "undefined symbol: png_LTX_RegisterPNGImage",
category=5) at dcgettext.c:52

stepping in further puts me into a mess of libc and assembly code I don't
fully understand, so...

(gdb) finish
Run till exit from #0  *__GI___dcgettext (domainname=0x380e755c95
<_libc_intl_domainname> "libc",
    msgid=0x618820 "undefined symbol: png_LTX_RegisterPNGImage",
category=5) at dcgettext.c:52
__dlerror () at dlerror.c:97
97                result->objname[0] == '\0' ? "" : ": ",
Value returned is $1 = 0x618820 "\360ua"

Well, that's not so good.  Running outside of dmtcp, the same call results
in "undefined symbol: png_LTX_RegisterPNGImage" returned from that
function.  So, it seems dcgettext has clobbered the error message, only
when running under dmtcp.

Worse yet, the result structure itself has been clobbered entirely:

(gdb) print result
$2 = (struct dl_action_result *) 0x7ffff7568880 <wrapper_init_buf>
(gdb) print *result
$3 = {errcode = 0, returned = 1, malloced = false, objname = 0x0, errstring
= 0x0}

This is probably bad for things like result->objname[0]....

Thoughts? Red herring?




On Wed, May 25, 2016 at 9:43 AM, Kyle Harrigan <kwharri...@gmail.com> wrote:

> In the mean time, what are the implications of disabling the alloc plugin
> temporarily?  In my case I'm using application-initiated checkpointing
> exclusively, for now.  At least for the single process case, it would seem
> I'm okay since it appears the main function of alloc plugin is to prevent
> alloc and mmap calls during the middle of a checkpoint (which would
> obviously be bad :->).
>
> It would seem unsafe for me to run any multiprocess case since only one of
> my processes is initiating the checkpoint, and the others could get caught
> in a malloc and bad things (tm) would happen.
>
> Am I understanding correctly?
>
>
> On Wed, May 25, 2016 at 9:37 AM, Kyle Harrigan <kwharri...@gmail.com>
> wrote:
>
>> Rohan,
>>
>> Disabling alloc plugin works for all cases.  This includes my original
>> popen case, the trivial ltdl case, and calling imagemagick "convert"
>> directly from command line using dmtcp_launch.
>>
>> So, you're on the right track.
>>
>>
>>
>>
>> On Wed, May 25, 2016 at 7:29 AM, Rohan Garg <rohg...@ccs.neu.edu> wrote:
>>
>>> I'm not sure yet what's the exact issue, but I don't see the segfault
>>> after disabling the alloc plugin (`dmtcp_launch --disable-alloc-plugin`).
>>> Could you try and see if it helps you too?
>>>
>>> The error has some thing to do with calls to dlerror() from libltdl
>>> when doing a symbol lookup. The function dlerror, in turn, calls
>>> malloc/realloc, which lands up in the wrappers defined in the alloc
>>> plugin.
>>>
>>> On Wed, May 25, 2016 at 06:43:36AM -0400, Rohan Garg wrote:
>>> > Hi Kyle,
>>> >
>>> > I investigated this a little further. Interestingly, I'm not able
>>> > to reproduce this locally with ImageMagick-6.9.3, and DMTCP-2.5-rc2.
>>> > This is on OpenSUSE-Tumbleweed, with gcc-5.3.1 and glibc-2.23.
>>> > Here's what I did:
>>> >
>>> >   $ dmtcp_launch convert ./test1.jpg ./test2.png
>>> >
>>> > To verify that it's calling in to ltdl, I ran it under gdb and put a
>>> > breakpoint on `lt_dlsym`. Indeed, it does call in to libltdl, but I
>>> > didn't get a segfault.
>>> >
>>> > However, I can confirm that I can reproduce the bug with your simple
>>> > ltdl test program. :-) I'll continue to look at it. Also, if you don't
>>> > mind, I'd like to open a new issue on Github to keep a track of this.
>>> >
>>> > Thanks,
>>> > Rohan
>>> >
>>> > On Tue, May 24, 2016 at 12:22:40AM -0400, Kyle Harrigan wrote:
>>> > > Yes, I think I've confirmed it is libltdl.  Trivial test case is
>>> here:
>>> > > https://github.com/kwharrigan/dl_vs_ltdl.
>>> > >
>>> > > dl and ltdl test both work fine outside of DMTCP.  Under
>>> dmtcp_launch, dl
>>> > > works but ltdl fails w/ basically the same stack trace I got in the
>>> other
>>> > > case.  Also note that launching ImageMagick "convert" directly under
>>> > > dmtcp_launch also fails.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Sun, May 22, 2016 at 6:11 PM, Jiajun Cao <jia...@ccs.neu.edu>
>>> wrote:
>>> > >
>>> > > > Hi Kyle,
>>> > > >
>>> > > > To confirm it's libltdl that causes the problem, you can write a
>>> simple
>>> > > > program that calls lt_dlsym (lt_dlopen if necessary), and returns.
>>> Run the
>>> > > > program under DMTCP to see if it crashes.
>>> > > > DMTCP doesn't have special handling for libltdl. It deals with
>>> libdl
>>> > > > instead. I wonder if that is the issue. You can replace lt_dlsym
>>> with dlsym
>>> > > > to verify the theory. Libdl should work fine under DMTCP.
>>> > > >
>>> > > > Best,
>>> > > > Jiajun
>>> > > >
>>> > > > On Sat, May 21, 2016 at 1:04 PM, Kyle Harrigan <
>>> kwharri...@gmail.com>
>>> > > > wrote:
>>> > > >
>>> > > >> Oh, and in case it matters ...  ImageMagick-6.7.2, CentOS 6.7,
>>> > > >> libtool-ltdl-2.2.6-15.5.el6.x86_64, Kernel
>>> 2.6.32-573.26.1.el6.x86_64 #1
>>> > > >> SMP Wed May 4 00:57:44 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> > > >>
>>> > > >>
>>> > > >> On Sat, May 21, 2016 at 1:00 PM, Kyle Harrigan <
>>> kwharri...@gmail.com>
>>> > > >> wrote:
>>> > > >>
>>> > > >>> Consider the following code
>>> > > >>> {{{
>>> > > >>>     FILE *fh;
>>> > > >>>     FILE *p;
>>> > > >>>     char *buf;
>>> > > >>>     size_t result;
>>> > > >>>     long fsize;
>>> > > >>>
>>> > > >>>     fh = fopen("demo.pgm", "rb");
>>> > > >>>
>>> > > >>>     // allocate buffer to hold file
>>> > > >>>     fseek(fh, 0, SEEK_END);
>>> > > >>>     fsize = ftell(fh);
>>> > > >>>     rewind(fh);
>>> > > >>>     buf = (char*) malloc(sizeof(char)*fsize);
>>> > > >>>
>>> > > >>>     // read from the source file
>>> > > >>>     result = fread(buf, 1, fsize, fh);
>>> > > >>>     printf("result: %d\n", result);
>>> > > >>>     fclose(fh);
>>> > > >>>
>>> > > >>>     // pipe to convert and write to png
>>> > > >>>     p = popen("convert pgm:- out.png", "w");
>>> > > >>>     //p = popen("cat - > out.pgm", "w");
>>> > > >>>     fwrite(buf, 1, fsize, p);
>>> > > >>>     pclose(p);
>>> > > >>>     free(buf);
>>> > > >>> }}}
>>> > > >>>
>>> > > >>> Run as:
>>> > > >>> $ dmtcp_launch ./test_convert
>>> > > >>>
>>> > > >>> It basically loads a file into memory, then pipes it into the
>>> convert
>>> > > >>> function via popen and stdin.
>>> > > >>>
>>> > > >>> This segment runs fine outside of DMTCP, but segfaults inside.
>>> > > >>>
>>> > > >>> An examination of the core dump is interesting.
>>> > > >>>
>>> > > >>> $ gdb /usr/bin/convert core.31830
>>> > > >>>
>>> > > >>> (gdb) bt
>>> > > >>> #0  0x00000034ec00f6ab in raise (sig=11) at
>>> > > >>> ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>>> > > >>> #1  0x0000003a37914e3f in MagickSignalHandler (signal_number=11)
>>> at
>>> > > >>> magick/magick.c:1168
>>> > > >>> #2  <signal handler called>
>>> > > >>> #3  0x00000034ec40142e in __dlerror () at dlerror.c:95
>>> > > >>> #4  0x00000034fde06685 in ?? () from /usr/lib64/libltdl.so.7
>>> > > >>> #5  0x00000034fde03986 in lt_dlsym () from
>>> /usr/lib64/libltdl.so.7
>>> > > >>> #6  0x0000003a37919965 in OpenModule (module=<value optimized
>>> out>,
>>> > > >>> exception=0x224d590)
>>> > > >>>     at magick/module.c:1293
>>> > > >>> #7  0x0000003a379151f3 in GetMagickInfo (name=0x7ffcd0f386a0
>>> "PGM",
>>> > > >>> exception=0x224d590)
>>> > > >>>     at magick/magick.c:442
>>> > > >>>
>>> > > >>> So, it is attempting to dynamically load the module to process
>>> the PGM
>>> > > >>> file.  Bad things appear to start near lt_dlsym, where a
>>> __dlerror is
>>> > > >>> thrown.
>>> > > >>>
>>> > > >>> I'm happy to do some further testing, and the above example
>>> should allow
>>> > > >>> you to test on your own  as well. If PGM is a weird input for
>>> you, I've
>>> > > >>> reproduced the error using PNG inputs, so it does not appear
>>> specific to
>>> > > >>> that library, but that the dynamic loader is pretty consistently
>>> failing.
>>> > > >>>
>>> > > >>>
>>> > > >>> --
>>> > > >>> -Kyle
>>> > > >>>
>>> > > >>
>>> > > >>
>>> > > >>
>>> > > >> --
>>> > > >> -Kyle
>>> > > >>
>>> > > >>
>>> > > >>
>>> ------------------------------------------------------------------------------
>>> > > >> Mobile security can be enabling, not merely restricting.
>>> Employees who
>>> > > >> bring their own devices (BYOD) to work are irked by the
>>> imposition of MDM
>>> > > >> restrictions. Mobile Device Manager Plus allows you to control
>>> only the
>>> > > >> apps on BYO-devices by containerizing them, leaving personal data
>>> > > >> untouched!
>>> > > >> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>> > > >> _______________________________________________
>>> > > >> Dmtcp-forum mailing list
>>> > > >> Dmtcp-forum@lists.sourceforge.net
>>> > > >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>> > > >>
>>> > > >>
>>> > > >
>>> > >
>>> > >
>>> > > --
>>> > > -Kyle
>>> >
>>> > >
>>> ------------------------------------------------------------------------------
>>> > > Mobile security can be enabling, not merely restricting. Employees
>>> who
>>> > > bring their own devices (BYOD) to work are irked by the imposition
>>> of MDM
>>> > > restrictions. Mobile Device Manager Plus allows you to control only
>>> the
>>> > > apps on BYO-devices by containerizing them, leaving personal data
>>> untouched!
>>> > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>> >
>>> > > _______________________________________________
>>> > > Dmtcp-forum mailing list
>>> > > Dmtcp-forum@lists.sourceforge.net
>>> > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>> >
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > Mobile security can be enabling, not merely restricting. Employees who
>>> > bring their own devices (BYOD) to work are irked by the imposition of
>>> MDM
>>> > restrictions. Mobile Device Manager Plus allows you to control only the
>>> > apps on BYO-devices by containerizing them, leaving personal data
>>> untouched!
>>> > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>>> > _______________________________________________
>>> > Dmtcp-forum mailing list
>>> > Dmtcp-forum@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>
>>
>>
>>
>> --
>> -Kyle
>>
>
>
>
> --
> -Kyle
>



-- 
-Kyle
------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to