put a pr, https://github.com/apache/incubator-heron/pull/3492

On Tue, Mar 17, 2020 at 5:53 PM Huijun Wu <[email protected]> wrote:

> Found a way to reproduce this issue:
>
> 1. run the ExclamationTopology job
> ~/.heron/bin/heron submit local \
> ~/.heron/examples/heron-api-examples.jar \
> org.apache.heron.examples.api.ExclamationTopology \
> hello-world-topology
>
> 2. kill one of the heron-executor, like this
> -----
> [tw-mbp-huijunw .herondata]$ PID=`ps -ef | grep SchedulerMain | grep java
> | awk '{print $2}'`
> [tw-mbp-huijunw .herondata]$ [ -z "$PID" ] && echo "topology not found" ||
> pstree -p $PID
> -+= 00001 root /sbin/launchd
>  \-+- 86666 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -cp /U
>    |-+= 86672 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho
>    | |--- 86696 huijunw ./heron-core/bin/heron-tmaster
> --topology_name=hello-world-topology --
>    | |--- 86698 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt
>    | |--- 86701 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>    | |--- 86703 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>    | \--- 86707 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>
>
>
>
>
>
> *   |-+= 86673 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho
>  | |--- 86700 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X   |
> |--- 86704 huijunw ./heron-core/bin/heron-stmgr
> --topology_name=hello-world-topology --to   | |--- 86705 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X   |
> |--- 86708 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X   |
> |--- 86710 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt   |
> \--- 86712 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X*
>    \-+= 86674 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho
>      |--- 86697 huijunw ./heron-core/bin/heron-stmgr
> --topology_name=hello-world-topology --to
>      |--- 86699 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>      |--- 86702 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>      |--- 86706 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>      |--- 86709 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt
>      \--- 86711 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
> [tw-mbp-huijunw .herondata]$ kill 86673
> [tw-mbp-huijunw .herondata]$ [ -z "$PID" ] && echo "topology not found" ||
> pstree -p $PID
> -+= 00001 root /sbin/launchd
>  \-+- 86666 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -cp /U
>    |-+= 86672 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho
>    | |--- 86696 huijunw ./heron-core/bin/heron-tmaster
> --topology_name=hello-world-topology --
>    | |--- 86698 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt
>    | |--- 86701 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>    | |--- 86703 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>    | \--- 86707 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>    |-+= 86674 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho
>    | |--- 86697 huijunw (heron-stmgr)
>    | |--- 86699 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>    | |--- 86702 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>    | |--- 86706 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>    | |--- 86709 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt
>    | \--- 86711 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>
>
>
>
>
>
> *   \-+= 86745 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho
>  |--- 86753 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>  |--- 86754 huijunw ./heron-core/bin/heron-stmgr
> --topology_name=hello-world-topology --to     |--- 86755 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>  |--- 86756 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X
>  |--- 86757 huijunw
> /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt
>  \--- 86758 huijunw
> /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X*
> -----
>
> 3. The exception appeared in the 86674 heron-executor log.
>
>
>
>
> On Sat, Mar 7, 2020 at 11:55 PM Huijun Wu <[email protected]>
> wrote:
>
>> cannot reproduce in local laptop. need more investigation how to
>> reproduce it.
>>
>> On Sat, Mar 7, 2020 at 3:06 PM Dmitry Rusakov <[email protected]>
>> wrote:
>>
>>> Could you please try to run a job locally and manually kill one
>>> container? Does it reproduce the problem on your laptop?
>>>
>>>
>>> On Sat, Mar 7, 2020 at 2:12 PM Huijun Wu <[email protected]>
>>> wrote:
>>>
>>>> Observed that, restarting single container will lead to all stmgr
>>>> failure and restarting. Thus killing a single stmgr should reproduce this
>>>> issue.
>>>>
>>>> On Tue, Mar 3, 2020 at 10:58 AM Ning Wang <[email protected]> wrote:
>>>>
>>>>> Ok. The log shows that the connection is being closed. So it just looks
>>>>> like a network issue causing the connection to be unstable.
>>>>>
>>>>> Looks like a racing condition to me if this theory is correct. You may
>>>>> need
>>>>> to review the object and add a mutex around the critical sections.
>>>>>
>>>>> On Tue, Mar 3, 2020 at 10:07 AM Xiaoyao Qian <[email protected]>
>>>>> wrote:
>>>>>
>>>>> > During normal running state. And it happens only to one data center..
>>>>> >
>>>>> > On Tue, Mar 3, 2020 at 9:33 AM Ning Wang <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> >> Yeah. Segfault is more likely in this case. But maybe not for the
>>>>> object
>>>>> >> itself but for the vtable pointer.
>>>>> >>
>>>>> >> Pure function call might be possible though I think. It all depends
>>>>> on
>>>>> >> the 4 bytes in the address. It could happen If a new object of a
>>>>> different
>>>>> >> type has been created in the place or an object is moved to the
>>>>> memory
>>>>> >> block.
>>>>> >>
>>>>> >> More investigation is necessary.
>>>>> >>
>>>>> >> When does it happen? normal running state or when topo or instance
>>>>> is
>>>>> >> shutting down?
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Mar 3, 2020 at 8:34 AM Xiaoyao Qian <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >>> I agreed with the second possibility, but wouldn’t it cause
>>>>> segfault if
>>>>> >>> the object has been deleted but the method is still invoked?
>>>>> >>>
>>>>> >>> On Tue, Mar 3, 2020 at 01:16 Ning Wang <[email protected]>
>>>>> wrote:
>>>>> >>>
>>>>> >>>> Looks like the exception is caused by pure virtual function
>>>>> calling.
>>>>> >>>>
>>>>> >>>> Both exceptions are from BaseClient. It seems like the Client
>>>>> object
>>>>> >>>> doesn't have the virtual functions implemented, which is not
>>>>> expected.
>>>>> >>>>
>>>>> >>>> Another possibility is that the client object has been deleted
>>>>> hence
>>>>> >>>> the vtable is not valid any more. This could be something you can
>>>>> check
>>>>> >>>> given the last log shows "Stmgr stmgr-12 closed connection".
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Mon, Mar 2, 2020 at 12:20 PM Huijun Wu <
>>>>> [email protected]>
>>>>> >>>> wrote:
>>>>> >>>>
>>>>> >>>>> Hi,
>>>>> >>>>>
>>>>> >>>>> We observed the Stmgr coredump, see the below logs. Anybody has
>>>>> any
>>>>> >>>>> idea on
>>>>> >>>>> this? Thanks.
>>>>> >>>>>
>>>>> >>>>> Best,
>>>>> >>>>> Huijun
>>>>> >>>>>
>>>>> >>>>> ------------
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: pure
>>>>> virtual
>>>>> >>>>> method
>>>>> >>>>> called
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: terminate
>>>>> called
>>>>> >>>>> without an active exception
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: ***
>>>>> Aborted at
>>>>> >>>>> 1582847383 (unix time) try "date -d @1582847383" if you are
>>>>> using GNU
>>>>> >>>>> date
>>>>> >>>>> ***
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: PC: @
>>>>> >>>>> 0x7f5bc42da277 __GI_raise
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: *** SIGABRT
>>>>> >>>>> (@0xbcb0000b338) received by PID 45880 (TID 0x7f5bc54c1780) from
>>>>> PID
>>>>> >>>>> 45880;
>>>>> >>>>> stack trace: ***
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7f5bc50a76d0 (unknown)
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7f5bc42da277 __GI_raise
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7f5bc42db968 __GI_abort
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7f5bc48e77d5 __gnu_cxx::__verbose_terminate_handler()
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7f5bc48e5746 (unknown)
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7f5bc48e5773 std::terminate()
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7f5bc48e62df __cxa_pure_virtual
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4dfe1d BaseClient::OnConnect()
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4de404 EventLoopImpl::handleWriteCallback()
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4f0a9f event_process_active_single_queue
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4f115f event_base_loop
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x40c81f main
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7f5bc42c6445 __libc_start_main
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x41257c (unknown)
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>>  0x0 (unknown)
>>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout:
>>>>> >>>>> ------------
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: pure
>>>>> virtual
>>>>> >>>>> method
>>>>> >>>>> called
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: terminate
>>>>> called
>>>>> >>>>> without an active exception
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: ***
>>>>> Aborted at
>>>>> >>>>> 1582837876 (unix time) try "date -d @1582837876" if you are
>>>>> using GNU
>>>>> >>>>> date
>>>>> >>>>> ***
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: PC: @
>>>>> >>>>> 0x7fe9d8935277 __GI_raise
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: *** SIGABRT
>>>>> >>>>> (@0xbcb0005176a) received by PID 333674 (TID 0x7fe9d9b1c780)
>>>>> from PID
>>>>> >>>>> 333674; stack trace: ***
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7fe9d97026d0 (unknown)
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7fe9d8935277 __GI_raise
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7fe9d8936968 __GI_abort
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7fe9d8f427d5 __gnu_cxx::__verbose_terminate_handler()
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7fe9d8f40746 (unknown)
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7fe9d8f40773 std::terminate()
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7fe9d8f412df __cxa_pure_virtual
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4df98a BaseClient::Start_Base()
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x443dfa heron::stmgr::StMgrClient::OnReConnectTimer()
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4df6ed
>>>>> >>>>>
>>>>> >>>>>
>>>>> _ZNSt17_Function_handlerIFvN9EventLoop6StatusEEZN10BaseClient13AddTimer_BaseESt8functionIFvvEElEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4ded81 EventLoopImpl::handleTimerCallback()
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4dc840 EventLoopImpl::eventLoopImplTimerCallback()
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4f0a9f event_process_active_single_queue
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x4f115f event_base_loop
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x40c81f main
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x7fe9d8921445 __libc_start_main
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>> 0x41257c (unknown)
>>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout:     @
>>>>> >>>>>  0x0 (unknown)
>>>>> >>>>> [2020-02-27 21:11:17 +0000] [INFO]: stmgr-233 stdout:
>>>>> >>>>> --------------------
>>>>> >>>>> I0227 21:11:16.813194 333674 stmgr-client.cpp:139] Will try to
>>>>> >>>>> reconnect
>>>>> >>>>> again after 1 seconds
>>>>> >>>>> E0227 21:11:16.813606 333674 baseconnection.cpp:142] BufferEvent
>>>>> >>>>> reported
>>>>> >>>>> error on connection 0x2d2b720
>>>>> >>>>> I0227 21:11:16.813617 333674 stmgr-client.cpp:133] Stmgr
>>>>> stmgr-154
>>>>> >>>>> running
>>>>> >>>>> at xxxxxxx:31831 closed connection with code Write error
>>>>> >>>>> I0227 21:11:16.813670 333674 stmgr-client.cpp:139] Will try to
>>>>> >>>>> reconnect
>>>>> >>>>> again after 1 seconds
>>>>> >>>>> E0227 21:11:16.814357 333674 baseconnection.cpp:142] BufferEvent
>>>>> >>>>> reported
>>>>> >>>>> error on connection 0x2d3aa00
>>>>> >>>>> I0227 21:11:16.814375 333674 stmgr-server.cpp:111] StMgrServer
>>>>> Got
>>>>> >>>>> connection close of 0x2d3aa00 from yyyyyyyy:35270
>>>>> >>>>> I0227 21:11:16.814378 333674 stmgr-server.cpp:121] Stmgr stmgr-86
>>>>> >>>>> closed
>>>>> >>>>> connection
>>>>> >>>>> E0227 21:11:16.814669 333674 baseconnection.cpp:142] BufferEvent
>>>>> >>>>> reported
>>>>> >>>>> error on connection 0x2d3a220
>>>>> >>>>> I0227 21:11:16.814683 333674 stmgr-server.cpp:111] StMgrServer
>>>>> Got
>>>>> >>>>> connection close of 0x2d3a220 from yyyyyyyy:33974
>>>>> >>>>> I0227 21:11:16.814685 333674 stmgr-server.cpp:121] Stmgr
>>>>> stmgr-154
>>>>> >>>>> closed
>>>>> >>>>> connection
>>>>> >>>>> E0227 21:11:16.816232 333674 baseconnection.cpp:142] BufferEvent
>>>>> >>>>> reported
>>>>> >>>>> error on connection 0x56ab800
>>>>> >>>>> I0227 21:11:16.816256 333674 stmgr-client.cpp:133] Stmgr
>>>>> stmgr-221
>>>>> >>>>> running
>>>>> >>>>> at xxxxxxx:31294 closed connection with code Write error
>>>>> >>>>> I0227 21:11:16.816263 333674 stmgr-client.cpp:139] Will try to
>>>>> >>>>> reconnect
>>>>> >>>>> again after 1 seconds
>>>>> >>>>> E0227 21:11:16.817833 333674 baseconnection.cpp:142] BufferEvent
>>>>> >>>>> reported
>>>>> >>>>> error on connection 0x2d295e0
>>>>> >>>>> I0227 21:11:16.817849 333674 stmgr-client.cpp:133] Stmgr stmgr-12
>>>>> >>>>> running
>>>>> >>>>> at xxxxxxx:31846 closed connection with code Write error
>>>>> >>>>> I0227 21:11:16.817853 333674 stmgr-client.cpp:139] Will try to
>>>>> >>>>> reconnect
>>>>> >>>>> again after 1 seconds
>>>>> >>>>> E0227 21:11:16.820160 333674 baseconnection.cpp:142] BufferEvent
>>>>> >>>>> reported
>>>>> >>>>> error on connection 0x2d39500
>>>>> >>>>> I0227 21:11:16.820192 333674 stmgr-server.cpp:111] StMgrServer
>>>>> Got
>>>>> >>>>> connection close of 0x2d39500 from yyyyyyyy:41220
>>>>> >>>>> I0227 21:11:16.820196 333674 stmgr-server.cpp:121] Stmgr stmgr-12
>>>>> >>>>> closed
>>>>> >>>>> connection
>>>>> >>>>>
>>>>> >>>> --
>>>>> >>> Thanks
>>>>> >>> Xiaoyao
>>>>> >>>
>>>>> >>
>>>>> >
>>>>> > --
>>>>> > Thanks
>>>>> > Xiaoyao
>>>>> >
>>>>>
>>>> --
>>> Best regards,
>>> Dmitry Rusakov
>>>
>>

Reply via email to