Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10696 )
Change subject: [DRAFT] IMPALA-6189: Add thread watchdogs for HDFS IO calls ...................................................................... Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/debug-util.cc File be/src/kudu/util/debug-util.cc: http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/debug-util.cc@62 PS4, Line 62: DEFINE_int32(stack_trace_sig_num, SIGRTMIN, "Signal used by the thread watchdog to " this is already settable in the Kudu version of the code by just calling SetStackTraceSignal(). We have it defined by an API instead of a flag because this code ends up linked into our client, etc, and we need it to be programmatically settable (the client has no command line). Is there some reason you couldn't use the existing API call, and put this flag in Impala's initialization sequence? http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/debug-util.cc@122 PS4, Line 122: // TODO: Implement this as a callback rather than polluting Kudu's util code? if you put it behind an ifdef of some kind that would be fine, and maybe easier than refactoring http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/debug-util.cc@272 PS4, Line 272: std::stringstream ret; why make this change? seems unnecessary. we usually avoid std::stringstream http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/kernel_stack_watchdog.cc File be/src/kudu/util/kernel_stack_watchdog.cc: http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/kernel_stack_watchdog.cc@135 PS4, Line 135: // TODO: Implement an exponential backoff? why? if we can't get it immediately, then backing off and trying later probably won't be successful either, and the actual timing becomes inaccurate http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/kernel_stack_watchdog.cc@151 PS4, Line 151: << "User stack:\n" << user_stack << "\n"; glog already puts a \n at the end of log messages, this shouldn't be necessary http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/runtime/io/handle-cache.inline.h File be/src/runtime/io/handle-cache.inline.h: http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/runtime/io/handle-cache.inline.h@34 PS4, Line 34: SCOPED_WATCHDOG(); you don't want some string in here indicating which file it's trying to open or somesuch? http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/service/jvmti-agent.cc File be/src/service/jvmti-agent.cc: http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/service/jvmti-agent.cc@27 PS4, Line 27: // by adding "-agentpath:<absolute path to libjvmtiagent.so> to the JVM commandline does this then prevent loading of any other java agents in impala? Do you actually need this separate shared library to do this even in the case where you are embedding the JVM? http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/util/jvm-stack-trace.h File be/src/util/jvm-stack-trace.h: http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/util/jvm-stack-trace.h@46 PS4, Line 46: void Collect(); this should document whether it's safe to be called from async context (signal handler) http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/util/jvm-stack-trace.cc File be/src/util/jvm-stack-trace.cc: http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/util/jvm-stack-trace.cc@65 PS4, Line 65: jvmtiEnv* jvmti = JVMTIState::GetInstance()->GetJVMTI(); How do you know that all of the functions used in this method are async-safe? Are they documented as such somewhere? The profilers I've seen use VM::_asyncGetCallTrace() for this purpose, not jvmti methods. Comments in the async-profiler project seem to indicate these calls may not be safe in async context: https://github.com/jvm-profiling-tools/async-profiler/blob/master/src/profiler.cpp#L249 https://github.com/jvm-profiling-tools/async-profiler/blob/master/src/profiler.cpp#L344 I think all the stack collection should be stress-tested by starting a thread which does stack traces of all other threads in a near-tight loop while running various Impala workloads. We found a lot of complex bugs causing hangs and crashes with this case. See debug-util-test.cc "DangerousOperationThread". -- To view, visit http://gerrit.cloudera.org:8080/10696 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28e918761c120043d332b034450208eaf34e3e2b Gerrit-Change-Number: 10696 Gerrit-PatchSet: 4 Gerrit-Owner: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Balazs Jeszenszky <[email protected]> Gerrit-Reviewer: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Zoram Thanga <[email protected]> Gerrit-Comment-Date: Thu, 12 Jul 2018 19:43:33 +0000 Gerrit-HasComments: Yes
