Hello Kudu Jenkins, Andrew Wong, Adar Dembo, Bankim Bhavsar,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/14908
to look at the new patch set (#3).
Change subject: [master/tserver] non-zero code from main() instead of crashing
......................................................................
[master/tserver] non-zero code from main() instead of crashing
Prior to this patch, Kudu masters and tablet servers would crash if
{Master,TabletServer}::{Init,Start}() returned non-OK status. As it's
seen, there is not much advantage in that behavior vs returning non-zero
code from main():
* Since those calls are in the main() function context, there is
an easy way to properly handle non-OK return codes from Init() and
Start() without sacrificing the consistency of the processes'
behavior and their address space: just return non-zero from main()
function.
* From the monitoring and reporting perspectives, it's possible to
detect a failure based on the exit status of a Kudu process.
* In most cases in production, core dumps are disabled, and only
minidumps were available from processes crashed in such cases.
However, given a minidump, there isn't much information available
for troubleshooting because of the stripped heap. As for the stack
trace provided with a minidump, it looks barely useful at all,
not providing even information that's available from the logs:
#0 0x00007f2445c691f7 in raise () from ./lib64/libc.so.6
#1 0x00007f2445c6a8e8 in abort () from ./lib64/libc.so.6
#2 0x0000000001bcf1e9 in kudu::AbortFailureFunction ()
at src/kudu/util/minidump.cc:190
#3 0x0000000000902fad in google::LogMessage::Fail ()
at thirdparty/src/glog-0.3.5/src/logging.cc:1488
#4 0x0000000000904f03 in google::LogMessage::SendToLog
(this=0x7ffc44ffb3c0)
at thirdparty/src/glog-0.3.5/src/logging.cc:1442
#5 0x0000000000902b09 in google::LogMessage::Flush
(this=this@entry=0x7ffc44ffb3c0)
at thirdparty/src/glog-0.3.5/src/logging.cc:1311
#6 0x000000000090588f in google::LogMessageFatal::~LogMessageFatal
(this=0x7ffc44ffb3c0, __in_chrg=<optimized out>)
at thirdparty/src/glog-0.3.5/src/logging.cc:2023
#7 0x000000000089c9c3 in kudu::master::MasterMain (argc=1,
argv=0x7ffc44ffbb60)
at src/kudu/master/master_main.cc:74
#8 0x00007f2445c55c05 in __libc_start_main () from ./lib64/libc.so.6
#9 0x000000000089c3c5 in _start ()
This patch changes the described behavior. I also updated the handling
of non-OK return status from CheckCPUFlags() during the earliest init
if detecting a non-SSE4.2/non-SSSE3 CPU.
With this patch, if failed to init or start, Kudu masters and tablet
servers write an error message into the log and exit with non-zero
status instead of crashing.
Change-Id: Id06646e2211eb24db28c582455d4a34af7501b26
---
M src/kudu/integration-tests/security-faults-itest.cc
M src/kudu/master/master_main.cc
M src/kudu/tserver/tablet_server_main.cc
M src/kudu/util/init.cc
M src/kudu/util/init.h
M src/kudu/util/logging.h
6 files changed, 37 insertions(+), 32 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/14908/3
--
To view, visit http://gerrit.cloudera.org:8080/14908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id06646e2211eb24db28c582455d4a34af7501b26
Gerrit-Change-Number: 14908
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Bankim Bhavsar <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)