Adar Dembo has posted comments on this change.

Change subject: Add Google Breakpad support to Kudu
......................................................................


Patch Set 21:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/5620/20/build-support/jenkins/build-and-test.sh
File build-support/jenkins/build-and-test.sh:

PS20, Line 500:     ls -al $TEST_TMPDIR
              :     find $TEST_TMPDIR
Maybe just "find $TEST_TMPDIR -ls" would yield the same thing as both of these?


http://gerrit.cloudera.org:8080/#/c/5620/21/java/kudu-client/src/test/java/org/apache/kudu/client/MiniKuduCluster.java
File java/kudu-client/src/test/java/org/apache/kudu/client/MiniKuduCluster.java:

Line 51: import static org.junit.Assert.assertTrue;
Unused?


http://gerrit.cloudera.org:8080/#/c/5620/21/src/kudu/master/master_main.cc
File src/kudu/master/master_main.cc:

Line 62:   CHECK_OK(BlockSigUSR1()); // Must be called before logging threads 
started.
Can we do this within InitGoogleLoggingSafe()? It'd affect a few more processes 
that way (mostly experiments and tpch stuff), but I think that's OK, and it'd 
be less fragile.


http://gerrit.cloudera.org:8080/#/c/5620/21/src/kudu/util/minidump-test.cc
File src/kudu/util/minidump-test.cc:

PS21, Line 85: #if defined(ADDRESS_SANITIZER)
             :   // CHECK raises a SIGSEGV under ASAN and fails this test.
             :   return;
             : #endif
This one is also surprising. CHECK() should yield a SIGABRT under the hood, and 
ASAN converts it to a SIGSEGV? Any interesting ASAN logs to go with it?


Line 113:   // ASAN appears to catch SIGBUS, SIGSEGV, and SIGFPE and the 
process is not killed.
Is there any documentation to corroborate this (and the TSAN case)? It's 
somewhat surprising; I would expect the sanitizers to at least chain their 
handlers to ours, otherwise they're breaking signal handling semantics for 
applications that use them.


Line 117: #elif defined(THREAD_SANITIZER)
I think it's technically possible to build with both ASAN and TSAN, so just for 
safety maybe convert these into two separate #ifdefs?

While you're at it, you may want to reorder this so that the LOG(INFO) isn't 
emitted erroneously (i.e. for signals that aren't actually being tested).


http://gerrit.cloudera.org:8080/#/c/5620/6/src/kudu/util/minidump.cc
File src/kudu/util/minidump.cc:

PS6, Line 28: #if defined(__linux__)
            : #include <breakpad/client/linux/handler/exception_handler.h>
            : #include <breakpad/common/linux/linux_libc_support.h>
            : #endif // defined(__linux__)
> It sounds good in theory but then you have to worry about gflag DEFINEs (do
Agreed that #ifdefs just here and conditional compilation of the minidump tests 
is net better than #ifdefs everywhere.


http://gerrit.cloudera.org:8080/#/c/5620/21/src/kudu/util/minidump.cc
File src/kudu/util/minidump.cc:

Line 68:     return false; // NOLINT(*)
What does lint dislike about this?


PS21, Line 178: non-user 
Nit: non-user signal (or non-user-signal)


Line 179:   google::InstallFailureFunction(nullptr);
Maybe we should move this down to just before the breakpad exception handler is 
actually installed? We wouldn't want this behavior on failure, right?


PS21, Line 261:   // Send SIGUSR1 signal to thread, which will wake it up.
              :   kill(getpid(), SIGUSR1);
I found this confusing, since technically we're sending the signal to the 
entire process. AFAICT this means it is distributed to any thread in the 
process at random. But the way this works (I think) is:
- We block SIGUSR1 in the process early, before any threads have started.
- Any new thread inherits the signal mask of its parent, so all Kudu threads 
have SIGUSR1 blocked.
- A thread calling sigwait() unblocks SIGUSR1 only to itself, guaranteeing that 
a SIGUSR1 will be picked up by sigwait().

This is based on my reading of 
http://elias.rhi.hi.is/libc/Threads-and-Signal-Handling.html.

Out of curiosity, did you look into using signalfd() instead? The advantage of 
signalfd() is that now you have yet another fd to wait on for delivering 
SIGUSR1, and you can use primitives like select(), poll(), or epoll() to wait 
on it in the same way you'd wait on other stuff, or wait on it in libev. This 
requires more boilerplate code, but should allow you to use a real condition 
variable to notify the thread that it should die vs. sending a SIGUSR1.

I'm nervous enough about UNIX signal handling that I can't recommend one over 
the other, but just I thought I'd ask. 
https://ldpreload.com/blog/signalfd-is-useless?reposted-on-request is also a 
good read, though it's about signalfd(), not sigwait().

Incidentally, a blocked SIGUSR1 will be inherited by all Kudu subprocesses. I 
don't think that's an issue today because we don't launch subprocesses in 
production, but we will soon (Todd is introducing something like that for 
kerberos). Blocking SIGUSR1 for them may be innocuous if they're short-lived, 
but if they're long-lived it could be problematic. May be a big fat warning 
somewhere in subprocess.h is in order?


Line 270:     CHECK_EQ(0, sigwait(&signals, &signal));
maybe PCHECK() would be more appropriate here, since EINVAL via errno is 
technically possible?


http://gerrit.cloudera.org:8080/#/c/5620/6/src/kudu/util/minidump.h
File src/kudu/util/minidump.h:

Line 41: // handling facilities, this class must be invoked after installing 
those
> I attempted using a GoogleOnce-based approach before rewriting this as a si
I don't follow issue #1. Every KuduTest includes a directory hierarchy that's 
created at test start and deleted recursively at test stop. It'd be natural for 
the minidump files to be placed somewhere in that hierarchy and then you'd get 
automatic cleanup. Why doesn't this work?

I also don't understand #2. Why is it important for different test invocations 
to use the same minidump directory? Why can't KuduTest provide a method that'll 
give an individual test direct access to that directory, wherever it may be?

Personally I don't like introducing new singletons because it gets really messy 
when you need to do something like plumb non-singleton state into the 
singleton. An example that plagued me in the past: plumbing a memtracker 
(non-singleton) into the block cache (singleton). That gets messy in a 
MiniCluster world.

I also think that all our singletons should adhere to the same (narrow) 
interface: a GetInstance() method that returns a bare pointer to the singleton 
object, constructing it when calling for the first time. This means there's no 
lifecycle; the object is created implicitly when it's first needed, and lives 
forever. It can be easily implemented with a GoogleOnce or what-have-you, and 
is very easy to reason about.

I think a "singleton with a lifecycle" is the worst of both worlds, IMHO. 
You've got all the plumbing problems that a singleton gives you, along with all 
of the complexity of an object lifecycle.


-- 
To view, visit http://gerrit.cloudera.org:8080/5620
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I495695cc38b75377f20b0497093a81ed5baa887f
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Dinesh Bhat <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to