Hi Jiajun,

Thank you very much for the help. To answer your questions:

1. I'm running on a single node with 32 cores and tcp.
2. I've only seen it a few times, and am working on finding a way to
reliable reproduce the problem.
3. I've only experienced it after restart.
4. Once I figure out how to reproduce the problem, I can give you access to
the node. If you could help diagnose the issue, that would be great!

I just tried re-compiling dmtcp with static libstdc++ to see if that will
help.

Best,
John


On Sun, Jul 17, 2016 at 1:29 PM, Jiajun Cao <jia...@ccs.neu.edu> wrote:

> Hi John,
>
> This is interesting. Looks like the application fails when doing a poll().
> There is some info I'd like to collect:
>
> 1. Are you running the application on a single node or on several nodes?
> If it's a distributed application, what't the communication fabric, tcp?
> InfiniBand?
> 2. What's the odd of failure? Can you give an estimate? It can be a race
> condition based on your description.
> 3. Does it happen only after restart?
> 4. As you mentioned, it's difficult to get enough info from the backtrace.
> If it's possible, could you give us access to your cluster, and speculate
> how to stably reproduce the bug, so that we can help you diagnose the issue?
>
>
> Rohan,
>
> I remember you committed a change about poll() and select() recently.
> Could that be related?
>
> Best,
> Jiajun
>
> On Sun, Jul 17, 2016 at 9:44 AM, John Moore <johnpmoor...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I've run into several segfaults when restarting my application with
>> dmtcp_restart. The error is shown below (I apologize, not very useful):
>>
>>
>> [workers6730-1:40000] Signal: Segmentation fault (11)
>> [workers6730-1:40000] Signal code:  (128)
>> [workers6730-1:40000] Failing at address: (nil)
>> [workers6730-1:40000] [ 0]
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f762e4518d0]
>> [workers6730-1:40000] [ 1]
>> /lib/x86_64-linux-gnu/libc.so.6(__poll+0x2d)[0x7f762e176d3d]
>> [workers6730-1:40000] [ 2]
>> /home/john/local/dmtcp_install/lib/dmtcp/libdmtcp_ipc.so(poll+0x31)[0x7f762ffc8c41]
>> [workers6730-1:40000] [ 3]
>> /home/john/local/openmpi-1.10.2_install/lib/libopen-pal.so.13(+0x6a658)[0x7f762f1e4658]
>> [workers6730-1:40000] [ 4]
>> /home/john/local/openmpi-1.10.2_install/lib/libopen-pal.so.13(opal_libevent2021_event_base_loop+0x1b2)[0x7f762f1dc2b2]
>> [workers6730-1:40000] [ 5] mpirun[0x404d24]
>> [workers6730-1:40000] [ 6] mpirun[0x4035e6]
>> [workers6730-1:40000] [ 7]
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f762e0b8b45]
>> [workers6730-1:40000] [ 8] mpirun[0x4034f9]
>>
>>
>> I'm using the master version of dmtcp as well as OpenMPI V. 1.10.2. The
>> command I use to launch dmtcp is:
>>
>> ~/local/dmtcp_install/bin/dmtcp_launch --interval 6400 --no-gzip -q -q
>> mpirun -np 16 ~/local/mpb_install/bin/mpb-mpi interactive?=false
>> k-split-index=1 k-split-num=3 "/home/john/scratch/working/FA211_3.ctl" >>
>> mpb_out
>>
>> And to restart, I simply call ./dmtcp_restart_script.sh
>>
>> I do not launch a separate coordinator, I allow dmtcp_launch to handle
>> the coordinator. Most of the time, the restart is successful, but every so
>> often I get this error.
>>
>> Any idea what might be going on here? Any advice would be greatly
>> appreciated.
>>
>> Thanks in advance!
>> John
>>
>>
>>
>> ------------------------------------------------------------------------------
>> What NetFlow Analyzer can do for you? Monitors network bandwidth and
>> traffic
>> patterns at an interface-level. Reveals which users, apps, and protocols
>> are
>> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
>> J-Flow, sFlow and other flows. Make informed decisions using capacity
>> planning
>> reports.http://sdm.link/zohodev2dev
>> _______________________________________________
>> Dmtcp-forum mailing list
>> Dmtcp-forum@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>
>>
>
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to