[ 
https://issues.apache.org/jira/browse/KUDU-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2148:
--------------------------------
    Description: 
When running the {{kudu}} CLI tool (particularly, {{kudu master status}} or 
{{kudu tserver status}}) against a master or a tablet server which is starting 
up, the starting process crashes when the RPC request arrives after the RPC 
server starts but before the embedded web server starts.

Below is an example of tserver's log captured for the scenario described above:

{noformat}
I0919 05:20:12.843554 19694 tablet_server_main.cc:72] Tablet server non-default 
flags:
--raft_heartbeat_interval_ms=50
--fs_data_dirs=/var/lib/kudu/tserver
--fs_wal_dir=/var/lib/kudu/tserver
--tserver_master_addrs=127.0.0.1:7051
--heap_profile_path=/tmp/kudu-tserver.19694
--unlock_experimental_flags=true
--unlock_unsafe_flags=true
--flagfile=/etc/kudu/conf/tserver.gflagfile
--log_dir=/var/log/kudu
--logtostderr=true
Tablet server version:
kudu 1.6.0-SNAPSHOT
revision 052d1def8c39a4e50f4957a95ae19764deb30938
build type DEBUG
built by jenkins at 19 Sep 2017 03:59:34 PST on my-node
build id 2017-09-19_03-59-27
I0919 05:20:12.844458 19694 minidump.cc:237] Setting minidump size limit to 20M
I0919 05:20:12.916566 19694 tablet_server_main.cc:79] Initializing tablet 
server...
I0919 05:20:12.969810 19694 system_ntp.cc:104] NTP initialized. Resolution in 
nanos?: 1 Skew PPM: 500 Current error: 47022
I0919 05:20:12.969980 19694 server_base.cc:228] Could not load existing FS 
layout: Not found: /var/lib/kudu/tserver/instance: No such file or directory 
(error 2)
I0919 05:20:12.969992 19694 server_base.cc:229] Creating new FS layout
I0919 05:20:12.985256 19694 fs_manager.cc:474] Generated new instance metadata 
in path /var/lib/kudu/tserver/instance:
uuid: "89b8ccc42dfb4d39a83d801ceb1ff0a1"
format_stamp: "Formatted at 2017-09-19 12:20:12 on my-node"
I0919 05:20:12.993276 19694 fs_manager.cc:431] Time spent creating directory 
manager: real 0.006s       user 0.000s     sys 0.000s
I0919 05:20:12.993454 19694 env_posix.cc:1589] Not raising process file limit 
of 65536; it is already as high as it can go
I0919 05:20:12.993530 19694 file_cache.cc:469] Constructed file cache lbm with 
capacity 26214
I0919 05:20:12.995923 19694 fs_manager.cc:335] Time spent opening block 
manager: real 0.000s    user 0.000s     sys 0.000s
I0919 05:20:12.995976 19694 fs_manager.cc:348] Opened local filesystem: 
/var/lib/kudu/tserver
uuid: "89b8ccc42dfb4d39a83d801ceb1ff0a1"
format_stamp: "Formatted at 2017-09-19 12:20:12 on my-node"
I0919 05:20:12.996026 19694 fs_report.cc:347] Block manager report
--------------------
1 data directories: /var/lib/kudu/tserver/data
Total live blocks: 0
Total live bytes: 0
Total live bytes (after alignment): 0
Total number of LBM containers: 0 (0 full)
Did not check for missing blocks
Did not check for orphaned blocks
Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)
Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
I0919 05:20:13.107158 19694 ts_tablet_manager.cc:216] Loaded tablet metadata (0 
live tablets)
I0919 05:20:13.107344 19694 tablet_server_main.cc:84] Starting tablet server...
I0919 05:20:13.149742 19694 rpc_server.cc:192] RPC server started. Bound to: 
0.0.0.0:7050
I0919 05:20:13.569461 19694 webserver.cc:149] Starting webserver on 0.0.0.0:8050
I0919 05:20:13.569495 19694 webserver.cc:160] Document root disabled
I0919 05:20:13.569829 19694 webserver.cc:282] Webserver started. Bound to: 
http://0.0.0.0:8050/
F0919 05:20:13.158834 19831 server_base.cc:316] Check failed: _s.ok() Bad 
status: Illegal state: Not started
*** Check failure stack trace: ***
I0919 05:20:13.589279 19694 tablet_server_main.cc:87] Tablet server 
successfully started.
I0919 05:20:13.570534 19842 process_memory.cc:181] Process hard memory limit is 
0.464404 GB
I0919 05:20:13.589357 19842 process_memory.cc:183] Process soft memory limit is 
0.371523 GB
I0919 05:20:13.589368 19842 process_memory.cc:186] Process memory pressure 
threshold is 0.278643 GB
Wrote minidump to 
/var/log/kudu/minidumps/kudu-tserver/1b23130b-f2d7-4724-7eb86aa5-72cfcc16.dmp
Wrote minidump to 
/var/log/kudu/minidumps/kudu-tserver/1b23130b-f2d7-4724-7eb86aa5-72cfcc16.dmp
*** Aborted at 1505823613 (unix time) try "date -d @1505823613" if you are 
using GNU date ***
PC: @     0x7ff5ea9b4067 gsignal
*** SIGABRT (@0x3e700004cee) received by PID 19694 (TID 0x7ff5adfe0700) from 
PID 19694; stack trace: ***
    @     0x7ff5ec6ce8d0 (unknown)
    @     0x7ff5ea9b4067 gsignal
    @     0x7ff5ea9b5448 abort
    @          0x2626d1b kudu::AbortFailureFunction()
    @          0x112f26d google::LogMessage::Fail()
W0919 05:20:13.640193 19841 heartbeater.cc:524] Failed to heartbeat to 
127.0.0.1:7051: Network error: Failed to ping master at 127.0.0.1:7051: Client 
connection negotiation failed: client connection to 127.0.0.1:7051: connect: 
Connection refused (error 111)
    @          0x113113d google::LogMessage::SendToLog()
    @          0x112eda9 google::LogMessage::Flush()
    @          0x1131bdf google::LogMessageFatal::~LogMessageFatal()
    @          0x11d88ba kudu::server::ServerBase::GetStatusPB()
    @          0x12034fc kudu::server::GenericServiceImpl::GetStatus()
    @          0x121f062 
_ZZN4kudu6server16GenericServiceIfC4ERK13scoped_refptrINS_12MetricEntityEERKS2_INS_3rpc13ResultTrackerEEENKUlPKN6google8protobuf7MessageEPSE_PNS7_10RpcContextEE10_clESG_SH_SJ_
    @          0x1221cf9 
_ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_6server16GenericServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEERKSD_INS7_13ResultTrackerEEEUlS4_S5_S9_E10_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
    @          0x24a902a std::function<>::operator()()
    @          0x24a888b kudu::rpc::GeneratedServiceIf::Handle()
    @          0x24aa9b4 kudu::rpc::ServicePool::RunThread()
    @          0x24aba2d boost::_mfi::mf0<>::operator()()
    @          0x24ab85c boost::_bi::list1<>::operator()<>()
    @          0x24ab751 boost::_bi::bind_t<>::operator()()
    @          0x24ab687 
boost::detail::function::void_function_obj_invoker0<>::invoke()
    @          0x24cb570 boost::function0<>::operator()()
    @          0x265cb52 kudu::Thread::SuperviseThread()
    @     0x7ff5ec6c70a4 start_thread
    @     0x7ff5eaa6787d clone
    @                0x0 (unknown)
{noformat}

  was:
When running the {{kudu}} CLI tool (particularly, {{kudu master status}} or 
{{kudu tserver status}}) against a master or a tablet server which is starting 
up, the staring process crashes when the RPC request arrives after the RPC 
server starts but before the embedded web server starts.

Below is an example of tserver's log from captured for the scenario described 
above:

{noformat}
I0919 05:20:12.843554 19694 tablet_server_main.cc:72] Tablet server non-default 
flags:
--raft_heartbeat_interval_ms=50
--fs_data_dirs=/var/lib/kudu/tserver
--fs_wal_dir=/var/lib/kudu/tserver
--tserver_master_addrs=127.0.0.1:7051
--heap_profile_path=/tmp/kudu-tserver.19694
--unlock_experimental_flags=true
--unlock_unsafe_flags=true
--flagfile=/etc/kudu/conf/tserver.gflagfile
--log_dir=/var/log/kudu
--logtostderr=true
Tablet server version:
kudu 1.6.0-SNAPSHOT
revision 052d1def8c39a4e50f4957a95ae19764deb30938
build type DEBUG
built by jenkins at 19 Sep 2017 03:59:34 PST on my-node
build id 2017-09-19_03-59-27
I0919 05:20:12.844458 19694 minidump.cc:237] Setting minidump size limit to 20M
I0919 05:20:12.916566 19694 tablet_server_main.cc:79] Initializing tablet 
server...
I0919 05:20:12.969810 19694 system_ntp.cc:104] NTP initialized. Resolution in 
nanos?: 1 Skew PPM: 500 Current error: 47022
I0919 05:20:12.969980 19694 server_base.cc:228] Could not load existing FS 
layout: Not found: /var/lib/kudu/tserver/instance: No such file or directory 
(error 2)
I0919 05:20:12.969992 19694 server_base.cc:229] Creating new FS layout
I0919 05:20:12.985256 19694 fs_manager.cc:474] Generated new instance metadata 
in path /var/lib/kudu/tserver/instance:
uuid: "89b8ccc42dfb4d39a83d801ceb1ff0a1"
format_stamp: "Formatted at 2017-09-19 12:20:12 on my-node"
I0919 05:20:12.993276 19694 fs_manager.cc:431] Time spent creating directory 
manager: real 0.006s       user 0.000s     sys 0.000s
I0919 05:20:12.993454 19694 env_posix.cc:1589] Not raising process file limit 
of 65536; it is already as high as it can go
I0919 05:20:12.993530 19694 file_cache.cc:469] Constructed file cache lbm with 
capacity 26214
I0919 05:20:12.995923 19694 fs_manager.cc:335] Time spent opening block 
manager: real 0.000s    user 0.000s     sys 0.000s
I0919 05:20:12.995976 19694 fs_manager.cc:348] Opened local filesystem: 
/var/lib/kudu/tserver
uuid: "89b8ccc42dfb4d39a83d801ceb1ff0a1"
format_stamp: "Formatted at 2017-09-19 12:20:12 on my-node"
I0919 05:20:12.996026 19694 fs_report.cc:347] Block manager report
--------------------
1 data directories: /var/lib/kudu/tserver/data
Total live blocks: 0
Total live bytes: 0
Total live bytes (after alignment): 0
Total number of LBM containers: 0 (0 full)
Did not check for missing blocks
Did not check for orphaned blocks
Total full LBM containers with extra space: 0 (0 repaired)
Total full LBM container extra space in bytes: 0 (0 repaired)
Total incomplete LBM containers: 0 (0 repaired)
Total LBM partial records: 0 (0 repaired)
I0919 05:20:13.107158 19694 ts_tablet_manager.cc:216] Loaded tablet metadata (0 
live tablets)
I0919 05:20:13.107344 19694 tablet_server_main.cc:84] Starting tablet server...
I0919 05:20:13.149742 19694 rpc_server.cc:192] RPC server started. Bound to: 
0.0.0.0:7050
I0919 05:20:13.569461 19694 webserver.cc:149] Starting webserver on 0.0.0.0:8050
I0919 05:20:13.569495 19694 webserver.cc:160] Document root disabled
I0919 05:20:13.569829 19694 webserver.cc:282] Webserver started. Bound to: 
http://0.0.0.0:8050/
F0919 05:20:13.158834 19831 server_base.cc:316] Check failed: _s.ok() Bad 
status: Illegal state: Not started
*** Check failure stack trace: ***
I0919 05:20:13.589279 19694 tablet_server_main.cc:87] Tablet server 
successfully started.
I0919 05:20:13.570534 19842 process_memory.cc:181] Process hard memory limit is 
0.464404 GB
I0919 05:20:13.589357 19842 process_memory.cc:183] Process soft memory limit is 
0.371523 GB
I0919 05:20:13.589368 19842 process_memory.cc:186] Process memory pressure 
threshold is 0.278643 GB
Wrote minidump to 
/var/log/kudu/minidumps/kudu-tserver/1b23130b-f2d7-4724-7eb86aa5-72cfcc16.dmp
Wrote minidump to 
/var/log/kudu/minidumps/kudu-tserver/1b23130b-f2d7-4724-7eb86aa5-72cfcc16.dmp
*** Aborted at 1505823613 (unix time) try "date -d @1505823613" if you are 
using GNU date ***
PC: @     0x7ff5ea9b4067 gsignal
*** SIGABRT (@0x3e700004cee) received by PID 19694 (TID 0x7ff5adfe0700) from 
PID 19694; stack trace: ***
    @     0x7ff5ec6ce8d0 (unknown)
    @     0x7ff5ea9b4067 gsignal
    @     0x7ff5ea9b5448 abort
    @          0x2626d1b kudu::AbortFailureFunction()
    @          0x112f26d google::LogMessage::Fail()
W0919 05:20:13.640193 19841 heartbeater.cc:524] Failed to heartbeat to 
127.0.0.1:7051: Network error: Failed to ping master at 127.0.0.1:7051: Client 
connection negotiation failed: client connection to 127.0.0.1:7051: connect: 
Connection refused (error 111)
    @          0x113113d google::LogMessage::SendToLog()
    @          0x112eda9 google::LogMessage::Flush()
    @          0x1131bdf google::LogMessageFatal::~LogMessageFatal()
    @          0x11d88ba kudu::server::ServerBase::GetStatusPB()
    @          0x12034fc kudu::server::GenericServiceImpl::GetStatus()
    @          0x121f062 
_ZZN4kudu6server16GenericServiceIfC4ERK13scoped_refptrINS_12MetricEntityEERKS2_INS_3rpc13ResultTrackerEEENKUlPKN6google8protobuf7MessageEPSE_PNS7_10RpcContextEE10_clESG_SH_SJ_
    @          0x1221cf9 
_ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_6server16GenericServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEERKSD_INS7_13ResultTrackerEEEUlS4_S5_S9_E10_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
    @          0x24a902a std::function<>::operator()()
    @          0x24a888b kudu::rpc::GeneratedServiceIf::Handle()
    @          0x24aa9b4 kudu::rpc::ServicePool::RunThread()
    @          0x24aba2d boost::_mfi::mf0<>::operator()()
    @          0x24ab85c boost::_bi::list1<>::operator()<>()
    @          0x24ab751 boost::_bi::bind_t<>::operator()()
    @          0x24ab687 
boost::detail::function::void_function_obj_invoker0<>::invoke()
    @          0x24cb570 boost::function0<>::operator()()
    @          0x265cb52 kudu::Thread::SuperviseThread()
    @     0x7ff5ec6c70a4 start_thread
    @     0x7ff5eaa6787d clone
    @                0x0 (unknown)
{noformat}


> 'kudu [master|tserver] status' may crash starting Kudu server
> -------------------------------------------------------------
>
>                 Key: KUDU-2148
>                 URL: https://issues.apache.org/jira/browse/KUDU-2148
>             Project: Kudu
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.5.0
>            Reporter: Alexey Serbin
>            Priority: Critical
>              Labels: newbie
>
> When running the {{kudu}} CLI tool (particularly, {{kudu master status}} or 
> {{kudu tserver status}}) against a master or a tablet server which is 
> starting up, the starting process crashes when the RPC request arrives after 
> the RPC server starts but before the embedded web server starts.
> Below is an example of tserver's log captured for the scenario described 
> above:
> {noformat}
> I0919 05:20:12.843554 19694 tablet_server_main.cc:72] Tablet server 
> non-default flags:
> --raft_heartbeat_interval_ms=50
> --fs_data_dirs=/var/lib/kudu/tserver
> --fs_wal_dir=/var/lib/kudu/tserver
> --tserver_master_addrs=127.0.0.1:7051
> --heap_profile_path=/tmp/kudu-tserver.19694
> --unlock_experimental_flags=true
> --unlock_unsafe_flags=true
> --flagfile=/etc/kudu/conf/tserver.gflagfile
> --log_dir=/var/log/kudu
> --logtostderr=true
> Tablet server version:
> kudu 1.6.0-SNAPSHOT
> revision 052d1def8c39a4e50f4957a95ae19764deb30938
> build type DEBUG
> built by jenkins at 19 Sep 2017 03:59:34 PST on my-node
> build id 2017-09-19_03-59-27
> I0919 05:20:12.844458 19694 minidump.cc:237] Setting minidump size limit to 
> 20M
> I0919 05:20:12.916566 19694 tablet_server_main.cc:79] Initializing tablet 
> server...
> I0919 05:20:12.969810 19694 system_ntp.cc:104] NTP initialized. Resolution in 
> nanos?: 1 Skew PPM: 500 Current error: 47022
> I0919 05:20:12.969980 19694 server_base.cc:228] Could not load existing FS 
> layout: Not found: /var/lib/kudu/tserver/instance: No such file or directory 
> (error 2)
> I0919 05:20:12.969992 19694 server_base.cc:229] Creating new FS layout
> I0919 05:20:12.985256 19694 fs_manager.cc:474] Generated new instance 
> metadata in path /var/lib/kudu/tserver/instance:
> uuid: "89b8ccc42dfb4d39a83d801ceb1ff0a1"
> format_stamp: "Formatted at 2017-09-19 12:20:12 on my-node"
> I0919 05:20:12.993276 19694 fs_manager.cc:431] Time spent creating directory 
> manager: real 0.006s     user 0.000s     sys 0.000s
> I0919 05:20:12.993454 19694 env_posix.cc:1589] Not raising process file limit 
> of 65536; it is already as high as it can go
> I0919 05:20:12.993530 19694 file_cache.cc:469] Constructed file cache lbm 
> with capacity 26214
> I0919 05:20:12.995923 19694 fs_manager.cc:335] Time spent opening block 
> manager: real 0.000s  user 0.000s     sys 0.000s
> I0919 05:20:12.995976 19694 fs_manager.cc:348] Opened local filesystem: 
> /var/lib/kudu/tserver
> uuid: "89b8ccc42dfb4d39a83d801ceb1ff0a1"
> format_stamp: "Formatted at 2017-09-19 12:20:12 on my-node"
> I0919 05:20:12.996026 19694 fs_report.cc:347] Block manager report
> --------------------
> 1 data directories: /var/lib/kudu/tserver/data
> Total live blocks: 0
> Total live bytes: 0
> Total live bytes (after alignment): 0
> Total number of LBM containers: 0 (0 full)
> Did not check for missing blocks
> Did not check for orphaned blocks
> Total full LBM containers with extra space: 0 (0 repaired)
> Total full LBM container extra space in bytes: 0 (0 repaired)
> Total incomplete LBM containers: 0 (0 repaired)
> Total LBM partial records: 0 (0 repaired)
> I0919 05:20:13.107158 19694 ts_tablet_manager.cc:216] Loaded tablet metadata 
> (0 live tablets)
> I0919 05:20:13.107344 19694 tablet_server_main.cc:84] Starting tablet 
> server...
> I0919 05:20:13.149742 19694 rpc_server.cc:192] RPC server started. Bound to: 
> 0.0.0.0:7050
> I0919 05:20:13.569461 19694 webserver.cc:149] Starting webserver on 
> 0.0.0.0:8050
> I0919 05:20:13.569495 19694 webserver.cc:160] Document root disabled
> I0919 05:20:13.569829 19694 webserver.cc:282] Webserver started. Bound to: 
> http://0.0.0.0:8050/
> F0919 05:20:13.158834 19831 server_base.cc:316] Check failed: _s.ok() Bad 
> status: Illegal state: Not started
> *** Check failure stack trace: ***
> I0919 05:20:13.589279 19694 tablet_server_main.cc:87] Tablet server 
> successfully started.
> I0919 05:20:13.570534 19842 process_memory.cc:181] Process hard memory limit 
> is 0.464404 GB
> I0919 05:20:13.589357 19842 process_memory.cc:183] Process soft memory limit 
> is 0.371523 GB
> I0919 05:20:13.589368 19842 process_memory.cc:186] Process memory pressure 
> threshold is 0.278643 GB
> Wrote minidump to 
> /var/log/kudu/minidumps/kudu-tserver/1b23130b-f2d7-4724-7eb86aa5-72cfcc16.dmp
> Wrote minidump to 
> /var/log/kudu/minidumps/kudu-tserver/1b23130b-f2d7-4724-7eb86aa5-72cfcc16.dmp
> *** Aborted at 1505823613 (unix time) try "date -d @1505823613" if you are 
> using GNU date ***
> PC: @     0x7ff5ea9b4067 gsignal
> *** SIGABRT (@0x3e700004cee) received by PID 19694 (TID 0x7ff5adfe0700) from 
> PID 19694; stack trace: ***
>     @     0x7ff5ec6ce8d0 (unknown)
>     @     0x7ff5ea9b4067 gsignal
>     @     0x7ff5ea9b5448 abort
>     @          0x2626d1b kudu::AbortFailureFunction()
>     @          0x112f26d google::LogMessage::Fail()
> W0919 05:20:13.640193 19841 heartbeater.cc:524] Failed to heartbeat to 
> 127.0.0.1:7051: Network error: Failed to ping master at 127.0.0.1:7051: 
> Client connection negotiation failed: client connection to 127.0.0.1:7051: 
> connect: Connection refused (error 111)
>     @          0x113113d google::LogMessage::SendToLog()
>     @          0x112eda9 google::LogMessage::Flush()
>     @          0x1131bdf google::LogMessageFatal::~LogMessageFatal()
>     @          0x11d88ba kudu::server::ServerBase::GetStatusPB()
>     @          0x12034fc kudu::server::GenericServiceImpl::GetStatus()
>     @          0x121f062 
> _ZZN4kudu6server16GenericServiceIfC4ERK13scoped_refptrINS_12MetricEntityEERKS2_INS_3rpc13ResultTrackerEEENKUlPKN6google8protobuf7MessageEPSE_PNS7_10RpcContextEE10_clESG_SH_SJ_
>     @          0x1221cf9 
> _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_6server16GenericServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEERKSD_INS7_13ResultTrackerEEEUlS4_S5_S9_E10_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
>     @          0x24a902a std::function<>::operator()()
>     @          0x24a888b kudu::rpc::GeneratedServiceIf::Handle()
>     @          0x24aa9b4 kudu::rpc::ServicePool::RunThread()
>     @          0x24aba2d boost::_mfi::mf0<>::operator()()
>     @          0x24ab85c boost::_bi::list1<>::operator()<>()
>     @          0x24ab751 boost::_bi::bind_t<>::operator()()
>     @          0x24ab687 
> boost::detail::function::void_function_obj_invoker0<>::invoke()
>     @          0x24cb570 boost::function0<>::operator()()
>     @          0x265cb52 kudu::Thread::SuperviseThread()
>     @     0x7ff5ec6c70a4 start_thread
>     @     0x7ff5eaa6787d clone
>     @                0x0 (unknown)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to