Adar Dembo has posted comments on this change.

Change subject: Control mutex stack walking in DEBUG mode with a gflag
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5741/2//COMMIT_MSG
Commit Message:

PS2, Line 9: In the case of a CHECK() failure, both the crashing thread and the
           : AsyncLogger thread may simultaneously attempt to symbolize their 
stacks.
           : AsyncLogger does this simply by acquiring a Mutex in RunThread(), 
which
           : triggers stack collection in DEBUG mode, while the crashing thread 
is
           : collecting the stack trace in order to print an important error 
message.
Useful information, but the way it's written it sounds like  
AsyncLogger::RunThread() is intentionally trying to acquire a symbolized stack 
trace, when in fact it's coincidental: the thread needs to protect its critical 
state and acquires lock_ to do so. Could you reword this paragraph to reflect 
that?


PS2, Line 25: This patch disables the Mutex owner stack trace collection on 
DEBUG
            : builds by default, only enabling it when a hidden gflag is set.
Is there no better fix than disabling stack collection in mutexes by default? 
That seems really harsh.

I'm trying to understand exactly how the race manifests. The AsyncLogger 
doesn't wrap FATAL messages, but FATAL messages are also logged to other 
severities first. The AsyncLogger will receive a Write() call for each 
non-FATAL severities, and, due to the code in Write(), will also Flush() at 
each severity. Only after all those Flush() calls (which are synchronous) will 
the glog FATAL handling take effect and crash the process. So, where in here 
does AsyncLogger::RunThread() run again and acquire a lock? There should be 
nothing left to log by the time the glog FATAL handling takes effect.

More generally though, it is problematic for any kudu::Mutex acquisition that 
happens concurrently with a CHECK() messes up the stack trace of the CHECK(). 
Perhaps we could loop the stack trace acquisition until we get a real stack? Or 
perhaps we can patch stacktrace_libunwind-inl.h so that g_now_entering is 
thread-local? Using a global and not TLS means that different threads are 
prevented from concurrently getting their stack traces, and the large block 
comment there suggests this was not their intent.


-- 
To view, visit http://gerrit.cloudera.org:8080/5741
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie4593cf7173867ce2f6151e03df0be94f97d95d2
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to