Adar Dembo has posted comments on this change. Change subject: Control mutex stack walking in DEBUG mode with a gflag ......................................................................
Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/5741/2//COMMIT_MSG Commit Message: PS2, Line 9: In the case of a CHECK() failure, both the crashing thread and the : AsyncLogger thread may simultaneously attempt to symbolize their stacks. : AsyncLogger does this simply by acquiring a Mutex in RunThread(), which : triggers stack collection in DEBUG mode, while the crashing thread is : collecting the stack trace in order to print an important error message. Useful information, but the way it's written it sounds like AsyncLogger::RunThread() is intentionally trying to acquire a symbolized stack trace, when in fact it's coincidental: the thread needs to protect its critical state and acquires lock_ to do so. Could you reword this paragraph to reflect that? PS2, Line 25: This patch disables the Mutex owner stack trace collection on DEBUG : builds by default, only enabling it when a hidden gflag is set. Is there no better fix than disabling stack collection in mutexes by default? That seems really harsh. I'm trying to understand exactly how the race manifests. The AsyncLogger doesn't wrap FATAL messages, but FATAL messages are also logged to other severities first. The AsyncLogger will receive a Write() call for each non-FATAL severities, and, due to the code in Write(), will also Flush() at each severity. Only after all those Flush() calls (which are synchronous) will the glog FATAL handling take effect and crash the process. So, where in here does AsyncLogger::RunThread() run again and acquire a lock? There should be nothing left to log by the time the glog FATAL handling takes effect. More generally though, it is problematic for any kudu::Mutex acquisition that happens concurrently with a CHECK() messes up the stack trace of the CHECK(). Perhaps we could loop the stack trace acquisition until we get a real stack? Or perhaps we can patch stacktrace_libunwind-inl.h so that g_now_entering is thread-local? Using a global and not TLS means that different threads are prevented from concurrently getting their stack traces, and the large block comment there suggests this was not their intent. -- To view, visit http://gerrit.cloudera.org:8080/5741 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie4593cf7173867ce2f6151e03df0be94f97d95d2 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
