Hi John,

This is interesting. Looks like the application fails when doing a poll().
There is some info I'd like to collect:

1. Are you running the application on a single node or on several nodes? If
it's a distributed application, what't the communication fabric, tcp?
InfiniBand?
2. What's the odd of failure? Can you give an estimate? It can be a race
condition based on your description.
3. Does it happen only after restart?
4. As you mentioned, it's difficult to get enough info from the backtrace.
If it's possible, could you give us access to your cluster, and speculate
how to stably reproduce the bug, so that we can help you diagnose the issue?


Rohan,

I remember you committed a change about poll() and select() recently. Could
that be related?

Best,
Jiajun

On Sun, Jul 17, 2016 at 9:44 AM, John Moore <johnpmoor...@gmail.com> wrote:

> Hi all,
>
> I've run into several segfaults when restarting my application with
> dmtcp_restart. The error is shown below (I apologize, not very useful):
>
>
> [workers6730-1:40000] Signal: Segmentation fault (11)
> [workers6730-1:40000] Signal code:  (128)
> [workers6730-1:40000] Failing at address: (nil)
> [workers6730-1:40000] [ 0]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f762e4518d0]
> [workers6730-1:40000] [ 1]
> /lib/x86_64-linux-gnu/libc.so.6(__poll+0x2d)[0x7f762e176d3d]
> [workers6730-1:40000] [ 2]
> /home/john/local/dmtcp_install/lib/dmtcp/libdmtcp_ipc.so(poll+0x31)[0x7f762ffc8c41]
> [workers6730-1:40000] [ 3]
> /home/john/local/openmpi-1.10.2_install/lib/libopen-pal.so.13(+0x6a658)[0x7f762f1e4658]
> [workers6730-1:40000] [ 4]
> /home/john/local/openmpi-1.10.2_install/lib/libopen-pal.so.13(opal_libevent2021_event_base_loop+0x1b2)[0x7f762f1dc2b2]
> [workers6730-1:40000] [ 5] mpirun[0x404d24]
> [workers6730-1:40000] [ 6] mpirun[0x4035e6]
> [workers6730-1:40000] [ 7]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f762e0b8b45]
> [workers6730-1:40000] [ 8] mpirun[0x4034f9]
>
>
> I'm using the master version of dmtcp as well as OpenMPI V. 1.10.2. The
> command I use to launch dmtcp is:
>
> ~/local/dmtcp_install/bin/dmtcp_launch --interval 6400 --no-gzip -q -q
> mpirun -np 16 ~/local/mpb_install/bin/mpb-mpi interactive?=false
> k-split-index=1 k-split-num=3 "/home/john/scratch/working/FA211_3.ctl" >>
> mpb_out
>
> And to restart, I simply call ./dmtcp_restart_script.sh
>
> I do not launch a separate coordinator, I allow dmtcp_launch to handle the
> coordinator. Most of the time, the restart is successful, but every so
> often I get this error.
>
> Any idea what might be going on here? Any advice would be greatly
> appreciated.
>
> Thanks in advance!
> John
>
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and
> traffic
> patterns at an interface-level. Reveals which users, apps, and protocols
> are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning
> reports.http://sdm.link/zohodev2dev
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to