Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10696 )

Change subject: [DRAFT] IMPALA-6189: Add thread watchdogs for HDFS IO calls
......................................................................


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/debug-util.cc
File be/src/kudu/util/debug-util.cc:

http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/debug-util.cc@62
PS4, Line 62: DEFINE_int32(stack_trace_sig_num, SIGRTMIN, "Signal used by the 
thread watchdog to "
this is already settable in the Kudu version of the code by just calling 
SetStackTraceSignal(). We have it defined by an API instead of a flag because 
this code ends up linked into our client, etc, and we need it to be 
programmatically settable (the client has no command line).

Is there some reason you couldn't use the existing API call, and put this flag 
in Impala's initialization sequence?


http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/debug-util.cc@122
PS4, Line 122:   // TODO: Implement this as a callback rather than polluting 
Kudu's util code?
if you put it behind an ifdef of some kind that would be fine, and maybe easier 
than refactoring


http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/debug-util.cc@272
PS4, Line 272:   std::stringstream ret;
why make this change? seems unnecessary. we usually avoid std::stringstream


http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/kernel_stack_watchdog.cc
File be/src/kudu/util/kernel_stack_watchdog.cc:

http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/kernel_stack_watchdog.cc@135
PS4, Line 135:         // TODO: Implement an exponential backoff?
why? if we can't get it immediately, then backing off and trying later probably 
won't be successful either, and the actual timing becomes inaccurate


http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/kudu/util/kernel_stack_watchdog.cc@151
PS4, Line 151:               << "User stack:\n" << user_stack << "\n";
glog already puts a \n at the end of log messages, this shouldn't be necessary


http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/runtime/io/handle-cache.inline.h
File be/src/runtime/io/handle-cache.inline.h:

http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/runtime/io/handle-cache.inline.h@34
PS4, Line 34:     SCOPED_WATCHDOG();
you don't want some string in here indicating which file it's trying to open or 
somesuch?


http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/service/jvmti-agent.cc
File be/src/service/jvmti-agent.cc:

http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/service/jvmti-agent.cc@27
PS4, Line 27: // by adding "-agentpath:<absolute path to libjvmtiagent.so> to 
the JVM commandline
does this then prevent loading of any other java agents in impala? Do you 
actually need this separate shared library to do this even in the case where 
you are embedding the JVM?


http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/util/jvm-stack-trace.h
File be/src/util/jvm-stack-trace.h:

http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/util/jvm-stack-trace.h@46
PS4, Line 46:     void Collect();
this should document whether it's safe to be called from async context (signal 
handler)


http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/util/jvm-stack-trace.cc
File be/src/util/jvm-stack-trace.cc:

http://gerrit.cloudera.org:8080/#/c/10696/4/be/src/util/jvm-stack-trace.cc@65
PS4, Line 65:   jvmtiEnv* jvmti = JVMTIState::GetInstance()->GetJVMTI();
How do you know that all of the functions used in this method are async-safe? 
Are they documented as such somewhere?

The profilers I've seen use VM::_asyncGetCallTrace() for this purpose, not 
jvmti methods. Comments in the async-profiler project seem to indicate these 
calls may not be safe in async context:

https://github.com/jvm-profiling-tools/async-profiler/blob/master/src/profiler.cpp#L249

https://github.com/jvm-profiling-tools/async-profiler/blob/master/src/profiler.cpp#L344


I think all the stack collection should be stress-tested by starting a thread 
which does stack traces of all other threads in a near-tight loop while running 
various Impala workloads. We found a lot of complex bugs causing hangs and 
crashes with this case. See debug-util-test.cc "DangerousOperationThread".



--
To view, visit http://gerrit.cloudera.org:8080/10696
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28e918761c120043d332b034450208eaf34e3e2b
Gerrit-Change-Number: 10696
Gerrit-PatchSet: 4
Gerrit-Owner: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Balazs Jeszenszky <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Philip Zeyliger <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Zoram Thanga <[email protected]>
Gerrit-Comment-Date: Thu, 12 Jul 2018 19:43:33 +0000
Gerrit-HasComments: Yes

Reply via email to