Alexey Serbin has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/14908 )

Change subject: [master/tserver] non-zero code from main() instead of crashing
......................................................................

[master/tserver] non-zero code from main() instead of crashing

Prior to this patch, Kudu masters and tablet servers would crash if
{Master,TabletServer}::{Init,Start}() returned non-OK status.  As it's
seen, there is not much advantage in that behavior vs returning non-zero
code from main():

  * Since those calls are in the main() function context, there is
    an easy way to properly handle non-OK return codes from Init() and
    Start() without sacrificing the consistency of the processes'
    behavior and their address space: just return non-zero from main()
    function.

  * From the monitoring and reporting perspectives, it's possible to
    detect a failure based on the exit status of a Kudu process.

  * In most cases in production, core dumps are disabled, and only
    minidumps were available from processes crashed in such cases.
    However, given a minidump, there isn't much information available
    for troubleshooting because of the stripped heap.  As for the stack
    trace provided with a minidump, it looks barely useful at all,
    not providing even information that's available from the logs:

    #0  0x00007f2445c691f7 in raise () from ./lib64/libc.so.6
    #1  0x00007f2445c6a8e8 in abort () from ./lib64/libc.so.6
    #2  0x0000000001bcf1e9 in kudu::AbortFailureFunction ()
            at src/kudu/util/minidump.cc:190
    #3  0x0000000000902fad in google::LogMessage::Fail ()
            at thirdparty/src/glog-0.3.5/src/logging.cc:1488
    #4  0x0000000000904f03 in google::LogMessage::SendToLog 
(this=0x7ffc44ffb3c0)
            at thirdparty/src/glog-0.3.5/src/logging.cc:1442
    #5  0x0000000000902b09 in google::LogMessage::Flush 
(this=this@entry=0x7ffc44ffb3c0)
            at thirdparty/src/glog-0.3.5/src/logging.cc:1311
    #6  0x000000000090588f in google::LogMessageFatal::~LogMessageFatal 
(this=0x7ffc44ffb3c0, __in_chrg=<optimized out>)
            at thirdparty/src/glog-0.3.5/src/logging.cc:2023
    #7  0x000000000089c9c3 in kudu::master::MasterMain (argc=1, 
argv=0x7ffc44ffbb60)
            at src/kudu/master/master_main.cc:74
    #8  0x00007f2445c55c05 in __libc_start_main () from ./lib64/libc.so.6
    #9  0x000000000089c3c5 in _start ()

This patch changes the described behavior.  I also updated the handling
of non-OK return status from CheckCPUFlags() during the earliest init
if detecting a non-SSE4.2/non-SSSE3 CPU.

With this patch, if failed to init or start, Kudu masters and tablet
servers write an error message into the log and exit with non-zero
status instead of crashing.

Change-Id: Id06646e2211eb24db28c582455d4a34af7501b26
Reviewed-on: http://gerrit.cloudera.org:8080/14908
Reviewed-by: Andrew Wong <[email protected]>
Reviewed-by: Adar Dembo <[email protected]>
Tested-by: Kudu Jenkins
---
M src/kudu/integration-tests/security-faults-itest.cc
M src/kudu/master/master_main.cc
M src/kudu/tserver/tablet_server_main.cc
M src/kudu/util/init.cc
M src/kudu/util/init.h
M src/kudu/util/status.h
6 files changed, 39 insertions(+), 31 deletions(-)

Approvals:
  Andrew Wong: Looks good to me, approved
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/14908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Id06646e2211eb24db28c582455d4a34af7501b26
Gerrit-Change-Number: 14908
Gerrit-PatchSet: 6
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Bankim Bhavsar <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)

Reply via email to