put a pr, https://github.com/apache/incubator-heron/pull/3492
On Tue, Mar 17, 2020 at 5:53 PM Huijun Wu <[email protected]> wrote: > Found a way to reproduce this issue: > > 1. run the ExclamationTopology job > ~/.heron/bin/heron submit local \ > ~/.heron/examples/heron-api-examples.jar \ > org.apache.heron.examples.api.ExclamationTopology \ > hello-world-topology > > 2. kill one of the heron-executor, like this > ----- > [tw-mbp-huijunw .herondata]$ PID=`ps -ef | grep SchedulerMain | grep java > | awk '{print $2}'` > [tw-mbp-huijunw .herondata]$ [ -z "$PID" ] && echo "topology not found" || > pstree -p $PID > -+= 00001 root /sbin/launchd > \-+- 86666 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -cp /U > |-+= 86672 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho > | |--- 86696 huijunw ./heron-core/bin/heron-tmaster > --topology_name=hello-world-topology -- > | |--- 86698 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt > | |--- 86701 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > | |--- 86703 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > | \--- 86707 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > > > > > > > * |-+= 86673 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho > | |--- 86700 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X | > |--- 86704 huijunw ./heron-core/bin/heron-stmgr > --topology_name=hello-world-topology --to | |--- 86705 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X | > |--- 86708 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X | > |--- 86710 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt | > \--- 86712 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X* > \-+= 86674 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho > |--- 86697 huijunw ./heron-core/bin/heron-stmgr > --topology_name=hello-world-topology --to > |--- 86699 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > |--- 86702 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > |--- 86706 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > |--- 86709 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt > \--- 86711 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > [tw-mbp-huijunw .herondata]$ kill 86673 > [tw-mbp-huijunw .herondata]$ [ -z "$PID" ] && echo "topology not found" || > pstree -p $PID > -+= 00001 root /sbin/launchd > \-+- 86666 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -cp /U > |-+= 86672 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho > | |--- 86696 huijunw ./heron-core/bin/heron-tmaster > --topology_name=hello-world-topology -- > | |--- 86698 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt > | |--- 86701 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > | |--- 86703 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > | \--- 86707 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > |-+= 86674 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho > | |--- 86697 huijunw (heron-stmgr) > | |--- 86699 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > | |--- 86702 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > | |--- 86706 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > | |--- 86709 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt > | \--- 86711 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > > > > > > > * \-+= 86745 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pytho > |--- 86753 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > |--- 86754 huijunw ./heron-core/bin/heron-stmgr > --topology_name=hello-world-topology --to |--- 86755 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > |--- 86756 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X > |--- 86757 huijunw > /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Pyt > \--- 86758 huijunw > /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/bin/java -X* > ----- > > 3. The exception appeared in the 86674 heron-executor log. > > > > > On Sat, Mar 7, 2020 at 11:55 PM Huijun Wu <[email protected]> > wrote: > >> cannot reproduce in local laptop. need more investigation how to >> reproduce it. >> >> On Sat, Mar 7, 2020 at 3:06 PM Dmitry Rusakov <[email protected]> >> wrote: >> >>> Could you please try to run a job locally and manually kill one >>> container? Does it reproduce the problem on your laptop? >>> >>> >>> On Sat, Mar 7, 2020 at 2:12 PM Huijun Wu <[email protected]> >>> wrote: >>> >>>> Observed that, restarting single container will lead to all stmgr >>>> failure and restarting. Thus killing a single stmgr should reproduce this >>>> issue. >>>> >>>> On Tue, Mar 3, 2020 at 10:58 AM Ning Wang <[email protected]> wrote: >>>> >>>>> Ok. The log shows that the connection is being closed. So it just looks >>>>> like a network issue causing the connection to be unstable. >>>>> >>>>> Looks like a racing condition to me if this theory is correct. You may >>>>> need >>>>> to review the object and add a mutex around the critical sections. >>>>> >>>>> On Tue, Mar 3, 2020 at 10:07 AM Xiaoyao Qian <[email protected]> >>>>> wrote: >>>>> >>>>> > During normal running state. And it happens only to one data center.. >>>>> > >>>>> > On Tue, Mar 3, 2020 at 9:33 AM Ning Wang <[email protected]> >>>>> wrote: >>>>> > >>>>> >> Yeah. Segfault is more likely in this case. But maybe not for the >>>>> object >>>>> >> itself but for the vtable pointer. >>>>> >> >>>>> >> Pure function call might be possible though I think. It all depends >>>>> on >>>>> >> the 4 bytes in the address. It could happen If a new object of a >>>>> different >>>>> >> type has been created in the place or an object is moved to the >>>>> memory >>>>> >> block. >>>>> >> >>>>> >> More investigation is necessary. >>>>> >> >>>>> >> When does it happen? normal running state or when topo or instance >>>>> is >>>>> >> shutting down? >>>>> >> >>>>> >> >>>>> >> On Tue, Mar 3, 2020 at 8:34 AM Xiaoyao Qian <[email protected]> >>>>> wrote: >>>>> >> >>>>> >>> I agreed with the second possibility, but wouldn’t it cause >>>>> segfault if >>>>> >>> the object has been deleted but the method is still invoked? >>>>> >>> >>>>> >>> On Tue, Mar 3, 2020 at 01:16 Ning Wang <[email protected]> >>>>> wrote: >>>>> >>> >>>>> >>>> Looks like the exception is caused by pure virtual function >>>>> calling. >>>>> >>>> >>>>> >>>> Both exceptions are from BaseClient. It seems like the Client >>>>> object >>>>> >>>> doesn't have the virtual functions implemented, which is not >>>>> expected. >>>>> >>>> >>>>> >>>> Another possibility is that the client object has been deleted >>>>> hence >>>>> >>>> the vtable is not valid any more. This could be something you can >>>>> check >>>>> >>>> given the last log shows "Stmgr stmgr-12 closed connection". >>>>> >>>> >>>>> >>>> >>>>> >>>> On Mon, Mar 2, 2020 at 12:20 PM Huijun Wu < >>>>> [email protected]> >>>>> >>>> wrote: >>>>> >>>> >>>>> >>>>> Hi, >>>>> >>>>> >>>>> >>>>> We observed the Stmgr coredump, see the below logs. Anybody has >>>>> any >>>>> >>>>> idea on >>>>> >>>>> this? Thanks. >>>>> >>>>> >>>>> >>>>> Best, >>>>> >>>>> Huijun >>>>> >>>>> >>>>> >>>>> ------------ >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: pure >>>>> virtual >>>>> >>>>> method >>>>> >>>>> called >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: terminate >>>>> called >>>>> >>>>> without an active exception >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: *** >>>>> Aborted at >>>>> >>>>> 1582847383 (unix time) try "date -d @1582847383" if you are >>>>> using GNU >>>>> >>>>> date >>>>> >>>>> *** >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: PC: @ >>>>> >>>>> 0x7f5bc42da277 __GI_raise >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: *** SIGABRT >>>>> >>>>> (@0xbcb0000b338) received by PID 45880 (TID 0x7f5bc54c1780) from >>>>> PID >>>>> >>>>> 45880; >>>>> >>>>> stack trace: *** >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7f5bc50a76d0 (unknown) >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7f5bc42da277 __GI_raise >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7f5bc42db968 __GI_abort >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7f5bc48e77d5 __gnu_cxx::__verbose_terminate_handler() >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7f5bc48e5746 (unknown) >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7f5bc48e5773 std::terminate() >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7f5bc48e62df __cxa_pure_virtual >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4dfe1d BaseClient::OnConnect() >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4de404 EventLoopImpl::handleWriteCallback() >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4f0a9f event_process_active_single_queue >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4f115f event_base_loop >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x40c81f main >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7f5bc42c6445 __libc_start_main >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x41257c (unknown) >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x0 (unknown) >>>>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: >>>>> >>>>> ------------ >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: pure >>>>> virtual >>>>> >>>>> method >>>>> >>>>> called >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: terminate >>>>> called >>>>> >>>>> without an active exception >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: *** >>>>> Aborted at >>>>> >>>>> 1582837876 (unix time) try "date -d @1582837876" if you are >>>>> using GNU >>>>> >>>>> date >>>>> >>>>> *** >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: PC: @ >>>>> >>>>> 0x7fe9d8935277 __GI_raise >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: *** SIGABRT >>>>> >>>>> (@0xbcb0005176a) received by PID 333674 (TID 0x7fe9d9b1c780) >>>>> from PID >>>>> >>>>> 333674; stack trace: *** >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7fe9d97026d0 (unknown) >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7fe9d8935277 __GI_raise >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7fe9d8936968 __GI_abort >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7fe9d8f427d5 __gnu_cxx::__verbose_terminate_handler() >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7fe9d8f40746 (unknown) >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7fe9d8f40773 std::terminate() >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7fe9d8f412df __cxa_pure_virtual >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4df98a BaseClient::Start_Base() >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x443dfa heron::stmgr::StMgrClient::OnReConnectTimer() >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4df6ed >>>>> >>>>> >>>>> >>>>> >>>>> _ZNSt17_Function_handlerIFvN9EventLoop6StatusEEZN10BaseClient13AddTimer_BaseESt8functionIFvvEElEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4ded81 EventLoopImpl::handleTimerCallback() >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4dc840 EventLoopImpl::eventLoopImplTimerCallback() >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4f0a9f event_process_active_single_queue >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x4f115f event_base_loop >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x40c81f main >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x7fe9d8921445 __libc_start_main >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x41257c (unknown) >>>>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>>>> >>>>> 0x0 (unknown) >>>>> >>>>> [2020-02-27 21:11:17 +0000] [INFO]: stmgr-233 stdout: >>>>> >>>>> -------------------- >>>>> >>>>> I0227 21:11:16.813194 333674 stmgr-client.cpp:139] Will try to >>>>> >>>>> reconnect >>>>> >>>>> again after 1 seconds >>>>> >>>>> E0227 21:11:16.813606 333674 baseconnection.cpp:142] BufferEvent >>>>> >>>>> reported >>>>> >>>>> error on connection 0x2d2b720 >>>>> >>>>> I0227 21:11:16.813617 333674 stmgr-client.cpp:133] Stmgr >>>>> stmgr-154 >>>>> >>>>> running >>>>> >>>>> at xxxxxxx:31831 closed connection with code Write error >>>>> >>>>> I0227 21:11:16.813670 333674 stmgr-client.cpp:139] Will try to >>>>> >>>>> reconnect >>>>> >>>>> again after 1 seconds >>>>> >>>>> E0227 21:11:16.814357 333674 baseconnection.cpp:142] BufferEvent >>>>> >>>>> reported >>>>> >>>>> error on connection 0x2d3aa00 >>>>> >>>>> I0227 21:11:16.814375 333674 stmgr-server.cpp:111] StMgrServer >>>>> Got >>>>> >>>>> connection close of 0x2d3aa00 from yyyyyyyy:35270 >>>>> >>>>> I0227 21:11:16.814378 333674 stmgr-server.cpp:121] Stmgr stmgr-86 >>>>> >>>>> closed >>>>> >>>>> connection >>>>> >>>>> E0227 21:11:16.814669 333674 baseconnection.cpp:142] BufferEvent >>>>> >>>>> reported >>>>> >>>>> error on connection 0x2d3a220 >>>>> >>>>> I0227 21:11:16.814683 333674 stmgr-server.cpp:111] StMgrServer >>>>> Got >>>>> >>>>> connection close of 0x2d3a220 from yyyyyyyy:33974 >>>>> >>>>> I0227 21:11:16.814685 333674 stmgr-server.cpp:121] Stmgr >>>>> stmgr-154 >>>>> >>>>> closed >>>>> >>>>> connection >>>>> >>>>> E0227 21:11:16.816232 333674 baseconnection.cpp:142] BufferEvent >>>>> >>>>> reported >>>>> >>>>> error on connection 0x56ab800 >>>>> >>>>> I0227 21:11:16.816256 333674 stmgr-client.cpp:133] Stmgr >>>>> stmgr-221 >>>>> >>>>> running >>>>> >>>>> at xxxxxxx:31294 closed connection with code Write error >>>>> >>>>> I0227 21:11:16.816263 333674 stmgr-client.cpp:139] Will try to >>>>> >>>>> reconnect >>>>> >>>>> again after 1 seconds >>>>> >>>>> E0227 21:11:16.817833 333674 baseconnection.cpp:142] BufferEvent >>>>> >>>>> reported >>>>> >>>>> error on connection 0x2d295e0 >>>>> >>>>> I0227 21:11:16.817849 333674 stmgr-client.cpp:133] Stmgr stmgr-12 >>>>> >>>>> running >>>>> >>>>> at xxxxxxx:31846 closed connection with code Write error >>>>> >>>>> I0227 21:11:16.817853 333674 stmgr-client.cpp:139] Will try to >>>>> >>>>> reconnect >>>>> >>>>> again after 1 seconds >>>>> >>>>> E0227 21:11:16.820160 333674 baseconnection.cpp:142] BufferEvent >>>>> >>>>> reported >>>>> >>>>> error on connection 0x2d39500 >>>>> >>>>> I0227 21:11:16.820192 333674 stmgr-server.cpp:111] StMgrServer >>>>> Got >>>>> >>>>> connection close of 0x2d39500 from yyyyyyyy:41220 >>>>> >>>>> I0227 21:11:16.820196 333674 stmgr-server.cpp:121] Stmgr stmgr-12 >>>>> >>>>> closed >>>>> >>>>> connection >>>>> >>>>> >>>>> >>>> -- >>>>> >>> Thanks >>>>> >>> Xiaoyao >>>>> >>> >>>>> >> >>>>> > >>>>> > -- >>>>> > Thanks >>>>> > Xiaoyao >>>>> > >>>>> >>>> -- >>> Best regards, >>> Dmitry Rusakov >>> >>
