Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/10696 )
Change subject: [DRAFT] IMPALA-6189: Add thread watchdogs for HDFS IO calls ...................................................................... Patch Set 3: (7 comments) Hi! Thanks for doing this. The big question for me is what our policy is here. Should we be doing this for all blocking system calls? All blocking RPCs? How do we maintain that we have decent coverage? For the Java stuff, an alternative approach is to call out via our regular Thrift-y/JNI-y route to ask Java to get the stack trace using management beans. I'm pretty sure you can match native thread ids to Java ones, though based on reading https://gist.github.com/rednaxelafx/843622/eb0b0877ff4aac77c76e5a50f7621dc32ea451eb it looks like it's hard. (In jstack, it's the nid=... but you need to do a hex to decimal conversion for pids. But it looks like it's not readily available out of Java.) I think it may also make sense to expose a counter on how often this happens. A monitoring tool would want to alert if this is happening a lot, and it won't want to grep the logs. Even more interesting would be to write down when it happens on behalf of a query in the profile, but that's not always possible. http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/exec/hdfs-table-sink.cc@410 PS3, Line 410: hdfsFileInfo* info = hdfsGetPathInfo(output_partition->hdfs_connection, This is an RPC too. Should we be watching it? http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/kudu/util/debug-util.cc File be/src/kudu/util/debug-util.cc: http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/kudu/util/debug-util.cc@62 PS3, Line 62: DEFINE_int32(stack_trace_sig_num, SIGRTMIN, "Signal used by the thread watchdog to " Why is this configurable? http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/kudu/util/debug-util.cc@212 PS3, Line 212: << "Kudu will not produce thread stack traces."; s/Kudu/Impala? http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/kudu/util/debug-util.cc@288 PS3, Line 288: ret << g_comm.jstack.Stringify(); This is new code. I think it may make sense to feature flag it (--kernel_watchdog_jstack_enabled) so that we can turn it off if we find that it's wrong. http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/runtime/io/scan-range.cc File be/src/runtime/io/scan-range.cc: http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/runtime/io/scan-range.cc@642 PS3, Line 642: current_bytes_read = hdfsRead(fs_, hdfs_file, buffer + *bytes_read, Isn't this the main thing we're interested in? http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/service/CMakeLists.txt File be/src/service/CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/10696/3/be/src/service/CMakeLists.txt@53 PS3, Line 53: # Jvmti agent library that registers with the JVM on load. As I mentioned elsewhere (perhaps later), I'm interested in whether this adds a lot to our binary. We're already launching the JVM from a statically linked binary; do we need to point it to another file to get methods? (There's a way around this for normal JNI libraries in JDK1.8+ though I'd have to look it up.) http://gerrit.cloudera.org:8080/#/c/10696/3/bin/impala-config.sh File bin/impala-config.sh: http://gerrit.cloudera.org:8080/#/c/10696/3/bin/impala-config.sh@619 PS3, Line 619: LIBHDFS_OPTS+=" -agentpath:$IMPALA_HOME/be/build/latest/service/libjvmtiagent.so" It surprises me that we have to do this with options, rather than being able to do it directly. What's going on is that we're doubling down here on our reliance on hadoop-hdfs/src/main/native/libhdfs/jni_helper.c. It's definitely not a public API that we should be depending on. Joe ran into https://issues.apache.org/jira/browse/HDFS-12628?jql=reporter%3Djoemcdonnell%20and%20project%3Dhdfs which is somewhat related. It's also problematic that we're introducing a new .so. We shouldn't have to since we statically link things. What's the size of the new .SO? Are we adding 100MB to everyone's builds? impala-config.sh is read in our build process. How do these settings work when we're running on clusters? -- To view, visit http://gerrit.cloudera.org:8080/10696 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28e918761c120043d332b034450208eaf34e3e2b Gerrit-Change-Number: 10696 Gerrit-PatchSet: 3 Gerrit-Owner: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Balazs Jeszenszky <[email protected]> Gerrit-Reviewer: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Reviewer: Zoram Thanga <[email protected]> Gerrit-Comment-Date: Fri, 22 Jun 2018 15:23:48 +0000 Gerrit-HasComments: Yes
