cannot reproduce in local laptop. need more investigation how to reproduce it.
On Sat, Mar 7, 2020 at 3:06 PM Dmitry Rusakov <[email protected]> wrote: > Could you please try to run a job locally and manually kill one container? > Does it reproduce the problem on your laptop? > > > On Sat, Mar 7, 2020 at 2:12 PM Huijun Wu <[email protected]> wrote: > >> Observed that, restarting single container will lead to all stmgr failure >> and restarting. Thus killing a single stmgr should reproduce this issue. >> >> On Tue, Mar 3, 2020 at 10:58 AM Ning Wang <[email protected]> wrote: >> >>> Ok. The log shows that the connection is being closed. So it just looks >>> like a network issue causing the connection to be unstable. >>> >>> Looks like a racing condition to me if this theory is correct. You may >>> need >>> to review the object and add a mutex around the critical sections. >>> >>> On Tue, Mar 3, 2020 at 10:07 AM Xiaoyao Qian <[email protected]> wrote: >>> >>> > During normal running state. And it happens only to one data center.. >>> > >>> > On Tue, Mar 3, 2020 at 9:33 AM Ning Wang <[email protected]> wrote: >>> > >>> >> Yeah. Segfault is more likely in this case. But maybe not for the >>> object >>> >> itself but for the vtable pointer. >>> >> >>> >> Pure function call might be possible though I think. It all depends on >>> >> the 4 bytes in the address. It could happen If a new object of a >>> different >>> >> type has been created in the place or an object is moved to the memory >>> >> block. >>> >> >>> >> More investigation is necessary. >>> >> >>> >> When does it happen? normal running state or when topo or instance is >>> >> shutting down? >>> >> >>> >> >>> >> On Tue, Mar 3, 2020 at 8:34 AM Xiaoyao Qian <[email protected]> >>> wrote: >>> >> >>> >>> I agreed with the second possibility, but wouldn’t it cause segfault >>> if >>> >>> the object has been deleted but the method is still invoked? >>> >>> >>> >>> On Tue, Mar 3, 2020 at 01:16 Ning Wang <[email protected]> wrote: >>> >>> >>> >>>> Looks like the exception is caused by pure virtual function calling. >>> >>>> >>> >>>> Both exceptions are from BaseClient. It seems like the Client object >>> >>>> doesn't have the virtual functions implemented, which is not >>> expected. >>> >>>> >>> >>>> Another possibility is that the client object has been deleted hence >>> >>>> the vtable is not valid any more. This could be something you can >>> check >>> >>>> given the last log shows "Stmgr stmgr-12 closed connection". >>> >>>> >>> >>>> >>> >>>> On Mon, Mar 2, 2020 at 12:20 PM Huijun Wu <[email protected] >>> > >>> >>>> wrote: >>> >>>> >>> >>>>> Hi, >>> >>>>> >>> >>>>> We observed the Stmgr coredump, see the below logs. Anybody has any >>> >>>>> idea on >>> >>>>> this? Thanks. >>> >>>>> >>> >>>>> Best, >>> >>>>> Huijun >>> >>>>> >>> >>>>> ------------ >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: pure virtual >>> >>>>> method >>> >>>>> called >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: terminate >>> called >>> >>>>> without an active exception >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: *** Aborted >>> at >>> >>>>> 1582847383 (unix time) try "date -d @1582847383" if you are using >>> GNU >>> >>>>> date >>> >>>>> *** >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: PC: @ >>> >>>>> 0x7f5bc42da277 __GI_raise >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: *** SIGABRT >>> >>>>> (@0xbcb0000b338) received by PID 45880 (TID 0x7f5bc54c1780) from >>> PID >>> >>>>> 45880; >>> >>>>> stack trace: *** >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7f5bc50a76d0 (unknown) >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7f5bc42da277 __GI_raise >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7f5bc42db968 __GI_abort >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7f5bc48e77d5 __gnu_cxx::__verbose_terminate_handler() >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7f5bc48e5746 (unknown) >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7f5bc48e5773 std::terminate() >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7f5bc48e62df __cxa_pure_virtual >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4dfe1d BaseClient::OnConnect() >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4de404 EventLoopImpl::handleWriteCallback() >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4f0a9f event_process_active_single_queue >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4f115f event_base_loop >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x40c81f main >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7f5bc42c6445 __libc_start_main >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x41257c (unknown) >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x0 (unknown) >>> >>>>> [2020-02-27 23:49:43 +0000] [INFO]: stmgr-233 stdout: >>> >>>>> ------------ >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: pure virtual >>> >>>>> method >>> >>>>> called >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: terminate >>> called >>> >>>>> without an active exception >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: *** Aborted >>> at >>> >>>>> 1582837876 (unix time) try "date -d @1582837876" if you are using >>> GNU >>> >>>>> date >>> >>>>> *** >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: PC: @ >>> >>>>> 0x7fe9d8935277 __GI_raise >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: *** SIGABRT >>> >>>>> (@0xbcb0005176a) received by PID 333674 (TID 0x7fe9d9b1c780) from >>> PID >>> >>>>> 333674; stack trace: *** >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7fe9d97026d0 (unknown) >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7fe9d8935277 __GI_raise >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7fe9d8936968 __GI_abort >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7fe9d8f427d5 __gnu_cxx::__verbose_terminate_handler() >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7fe9d8f40746 (unknown) >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7fe9d8f40773 std::terminate() >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7fe9d8f412df __cxa_pure_virtual >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4df98a BaseClient::Start_Base() >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x443dfa heron::stmgr::StMgrClient::OnReConnectTimer() >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4df6ed >>> >>>>> >>> >>>>> >>> _ZNSt17_Function_handlerIFvN9EventLoop6StatusEEZN10BaseClient13AddTimer_BaseESt8functionIFvvEElEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4ded81 EventLoopImpl::handleTimerCallback() >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4dc840 EventLoopImpl::eventLoopImplTimerCallback() >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4f0a9f event_process_active_single_queue >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x4f115f event_base_loop >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x40c81f main >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x7fe9d8921445 __libc_start_main >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x41257c (unknown) >>> >>>>> [2020-02-27 21:11:16 +0000] [INFO]: stmgr-233 stdout: @ >>> >>>>> 0x0 (unknown) >>> >>>>> [2020-02-27 21:11:17 +0000] [INFO]: stmgr-233 stdout: >>> >>>>> -------------------- >>> >>>>> I0227 21:11:16.813194 333674 stmgr-client.cpp:139] Will try to >>> >>>>> reconnect >>> >>>>> again after 1 seconds >>> >>>>> E0227 21:11:16.813606 333674 baseconnection.cpp:142] BufferEvent >>> >>>>> reported >>> >>>>> error on connection 0x2d2b720 >>> >>>>> I0227 21:11:16.813617 333674 stmgr-client.cpp:133] Stmgr stmgr-154 >>> >>>>> running >>> >>>>> at xxxxxxx:31831 closed connection with code Write error >>> >>>>> I0227 21:11:16.813670 333674 stmgr-client.cpp:139] Will try to >>> >>>>> reconnect >>> >>>>> again after 1 seconds >>> >>>>> E0227 21:11:16.814357 333674 baseconnection.cpp:142] BufferEvent >>> >>>>> reported >>> >>>>> error on connection 0x2d3aa00 >>> >>>>> I0227 21:11:16.814375 333674 stmgr-server.cpp:111] StMgrServer Got >>> >>>>> connection close of 0x2d3aa00 from yyyyyyyy:35270 >>> >>>>> I0227 21:11:16.814378 333674 stmgr-server.cpp:121] Stmgr stmgr-86 >>> >>>>> closed >>> >>>>> connection >>> >>>>> E0227 21:11:16.814669 333674 baseconnection.cpp:142] BufferEvent >>> >>>>> reported >>> >>>>> error on connection 0x2d3a220 >>> >>>>> I0227 21:11:16.814683 333674 stmgr-server.cpp:111] StMgrServer Got >>> >>>>> connection close of 0x2d3a220 from yyyyyyyy:33974 >>> >>>>> I0227 21:11:16.814685 333674 stmgr-server.cpp:121] Stmgr stmgr-154 >>> >>>>> closed >>> >>>>> connection >>> >>>>> E0227 21:11:16.816232 333674 baseconnection.cpp:142] BufferEvent >>> >>>>> reported >>> >>>>> error on connection 0x56ab800 >>> >>>>> I0227 21:11:16.816256 333674 stmgr-client.cpp:133] Stmgr stmgr-221 >>> >>>>> running >>> >>>>> at xxxxxxx:31294 closed connection with code Write error >>> >>>>> I0227 21:11:16.816263 333674 stmgr-client.cpp:139] Will try to >>> >>>>> reconnect >>> >>>>> again after 1 seconds >>> >>>>> E0227 21:11:16.817833 333674 baseconnection.cpp:142] BufferEvent >>> >>>>> reported >>> >>>>> error on connection 0x2d295e0 >>> >>>>> I0227 21:11:16.817849 333674 stmgr-client.cpp:133] Stmgr stmgr-12 >>> >>>>> running >>> >>>>> at xxxxxxx:31846 closed connection with code Write error >>> >>>>> I0227 21:11:16.817853 333674 stmgr-client.cpp:139] Will try to >>> >>>>> reconnect >>> >>>>> again after 1 seconds >>> >>>>> E0227 21:11:16.820160 333674 baseconnection.cpp:142] BufferEvent >>> >>>>> reported >>> >>>>> error on connection 0x2d39500 >>> >>>>> I0227 21:11:16.820192 333674 stmgr-server.cpp:111] StMgrServer Got >>> >>>>> connection close of 0x2d39500 from yyyyyyyy:41220 >>> >>>>> I0227 21:11:16.820196 333674 stmgr-server.cpp:121] Stmgr stmgr-12 >>> >>>>> closed >>> >>>>> connection >>> >>>>> >>> >>>> -- >>> >>> Thanks >>> >>> Xiaoyao >>> >>> >>> >> >>> > >>> > -- >>> > Thanks >>> > Xiaoyao >>> > >>> >> -- > Best regards, > Dmitry Rusakov >
