Adar Dembo has posted comments on this change. Change subject: Add Google Breakpad support to Kudu ......................................................................
Patch Set 21: (13 comments) http://gerrit.cloudera.org:8080/#/c/5620/20/build-support/jenkins/build-and-test.sh File build-support/jenkins/build-and-test.sh: PS20, Line 500: ls -al $TEST_TMPDIR : find $TEST_TMPDIR Maybe just "find $TEST_TMPDIR -ls" would yield the same thing as both of these? http://gerrit.cloudera.org:8080/#/c/5620/21/java/kudu-client/src/test/java/org/apache/kudu/client/MiniKuduCluster.java File java/kudu-client/src/test/java/org/apache/kudu/client/MiniKuduCluster.java: Line 51: import static org.junit.Assert.assertTrue; Unused? http://gerrit.cloudera.org:8080/#/c/5620/21/src/kudu/master/master_main.cc File src/kudu/master/master_main.cc: Line 62: CHECK_OK(BlockSigUSR1()); // Must be called before logging threads started. Can we do this within InitGoogleLoggingSafe()? It'd affect a few more processes that way (mostly experiments and tpch stuff), but I think that's OK, and it'd be less fragile. http://gerrit.cloudera.org:8080/#/c/5620/21/src/kudu/util/minidump-test.cc File src/kudu/util/minidump-test.cc: PS21, Line 85: #if defined(ADDRESS_SANITIZER) : // CHECK raises a SIGSEGV under ASAN and fails this test. : return; : #endif This one is also surprising. CHECK() should yield a SIGABRT under the hood, and ASAN converts it to a SIGSEGV? Any interesting ASAN logs to go with it? Line 113: // ASAN appears to catch SIGBUS, SIGSEGV, and SIGFPE and the process is not killed. Is there any documentation to corroborate this (and the TSAN case)? It's somewhat surprising; I would expect the sanitizers to at least chain their handlers to ours, otherwise they're breaking signal handling semantics for applications that use them. Line 117: #elif defined(THREAD_SANITIZER) I think it's technically possible to build with both ASAN and TSAN, so just for safety maybe convert these into two separate #ifdefs? While you're at it, you may want to reorder this so that the LOG(INFO) isn't emitted erroneously (i.e. for signals that aren't actually being tested). http://gerrit.cloudera.org:8080/#/c/5620/6/src/kudu/util/minidump.cc File src/kudu/util/minidump.cc: PS6, Line 28: #if defined(__linux__) : #include <breakpad/client/linux/handler/exception_handler.h> : #include <breakpad/common/linux/linux_libc_support.h> : #endif // defined(__linux__) > It sounds good in theory but then you have to worry about gflag DEFINEs (do Agreed that #ifdefs just here and conditional compilation of the minidump tests is net better than #ifdefs everywhere. http://gerrit.cloudera.org:8080/#/c/5620/21/src/kudu/util/minidump.cc File src/kudu/util/minidump.cc: Line 68: return false; // NOLINT(*) What does lint dislike about this? PS21, Line 178: non-user Nit: non-user signal (or non-user-signal) Line 179: google::InstallFailureFunction(nullptr); Maybe we should move this down to just before the breakpad exception handler is actually installed? We wouldn't want this behavior on failure, right? PS21, Line 261: // Send SIGUSR1 signal to thread, which will wake it up. : kill(getpid(), SIGUSR1); I found this confusing, since technically we're sending the signal to the entire process. AFAICT this means it is distributed to any thread in the process at random. But the way this works (I think) is: - We block SIGUSR1 in the process early, before any threads have started. - Any new thread inherits the signal mask of its parent, so all Kudu threads have SIGUSR1 blocked. - A thread calling sigwait() unblocks SIGUSR1 only to itself, guaranteeing that a SIGUSR1 will be picked up by sigwait(). This is based on my reading of http://elias.rhi.hi.is/libc/Threads-and-Signal-Handling.html. Out of curiosity, did you look into using signalfd() instead? The advantage of signalfd() is that now you have yet another fd to wait on for delivering SIGUSR1, and you can use primitives like select(), poll(), or epoll() to wait on it in the same way you'd wait on other stuff, or wait on it in libev. This requires more boilerplate code, but should allow you to use a real condition variable to notify the thread that it should die vs. sending a SIGUSR1. I'm nervous enough about UNIX signal handling that I can't recommend one over the other, but just I thought I'd ask. https://ldpreload.com/blog/signalfd-is-useless?reposted-on-request is also a good read, though it's about signalfd(), not sigwait(). Incidentally, a blocked SIGUSR1 will be inherited by all Kudu subprocesses. I don't think that's an issue today because we don't launch subprocesses in production, but we will soon (Todd is introducing something like that for kerberos). Blocking SIGUSR1 for them may be innocuous if they're short-lived, but if they're long-lived it could be problematic. May be a big fat warning somewhere in subprocess.h is in order? Line 270: CHECK_EQ(0, sigwait(&signals, &signal)); maybe PCHECK() would be more appropriate here, since EINVAL via errno is technically possible? http://gerrit.cloudera.org:8080/#/c/5620/6/src/kudu/util/minidump.h File src/kudu/util/minidump.h: Line 41: // handling facilities, this class must be invoked after installing those > I attempted using a GoogleOnce-based approach before rewriting this as a si I don't follow issue #1. Every KuduTest includes a directory hierarchy that's created at test start and deleted recursively at test stop. It'd be natural for the minidump files to be placed somewhere in that hierarchy and then you'd get automatic cleanup. Why doesn't this work? I also don't understand #2. Why is it important for different test invocations to use the same minidump directory? Why can't KuduTest provide a method that'll give an individual test direct access to that directory, wherever it may be? Personally I don't like introducing new singletons because it gets really messy when you need to do something like plumb non-singleton state into the singleton. An example that plagued me in the past: plumbing a memtracker (non-singleton) into the block cache (singleton). That gets messy in a MiniCluster world. I also think that all our singletons should adhere to the same (narrow) interface: a GetInstance() method that returns a bare pointer to the singleton object, constructing it when calling for the first time. This means there's no lifecycle; the object is created implicitly when it's first needed, and lives forever. It can be easily implemented with a GoogleOnce or what-have-you, and is very easy to reason about. I think a "singleton with a lifecycle" is the worst of both worlds, IMHO. You've got all the plumbing problems that a singleton gives you, along with all of the complexity of an object lifecycle. -- To view, visit http://gerrit.cloudera.org:8080/5620 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I495695cc38b75377f20b0497093a81ed5baa887f Gerrit-PatchSet: 21 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
