Agreed, let's abort with an abundance of caution. That is the _least_ that should be done when a thread dies unexpectedly. Maybe we can improve resiliency for specific cases later.
On Jan 26, 2017, at 5:53 PM, Enis Söztutar <enis....@gmail.com> wrote: >> Do we have worker threads that we can't safely continue without > indefinitely? Can we solve the general problem of "unhandled exception > in threads cause a RS Abort"? > We have this already in the code base. We are injecting an > UncaughtExceptionhandler (which is calling Abortable.abort) to almost all > of the HRegionServer service threads (see HRS.startServiceThreads). But > I've also seen cases where some important thread got unmarked. I think it > is good idea to revisit that and make sure that all the threads are > injected with the UEH. > > The replication source threads are started on demand, that is why the UEH > is not injected I think. But agreed that we should do the safe route here, > and abort the regionserver. > > Enis > >> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser <els...@apache.org> wrote: >> >> +1 If any "worker" thread can't safely/reasonably retry some unexpected >> exception without a reasonable expectation of self-healing, tank the RS. >> >> Having those threads die but not the RS could go uncaught for indefinite >> period of time. >> >> >> Sean Busbey wrote: >> >>> I've noticed a few other places where we can lose a worker thread and >>> the RegionServer happily continues. One notable example is the worker >>> threads that handle syncs for the WAL. I'm generally a >>> fail-fast-and-loud advocate, so I like aborting when things look >>> wonky. I've also had to deal with a lot of pain around silent and thus >>> hard to see replication failures, so strong signals that the >>> replication system is in a bad way sound good to me atm. >>> >>> Do we have worker threads that we can't safely continue without >>> indefinitely? Can we solve the general problem of "unhandled exception >>> in threads cause a RS Abort"? >>> >>> As mentioned on the jira, I do worry a bit about cluster stability and >>> cascading failures, given the ability to have user-provided endpoints >>> in the replication process. Ultimately, I don't see that as different >>> than all the other places coprocessors can put the cluster at risk. >>> >>>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<bus...@apache.org> wrote: >>>> >>>> (edited subject to ensure folks filtering for DISCUSS see this) >>>> >>>> >>>> >>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<ghelml...@gmail.com> >>>> wrote: >>>> >>>>> Over in HBASE-17381 there has been some discussion around whether an >>>>> unhandled exception in a ReplicationSourceWorkerThread should trigger a >>>>> regionserver abort. >>>>> >>>>> The current behavior in the case of an unexpected exception in >>>>> ReplicationSourceWorkerThread.run() is to log a message and simply let >>>>> the >>>>> thread die, allowing replication for this source to back up. >>>>> >>>>> I've seen this happen in an OOME scenario, which seems like a clear case >>>>> where we would be better off aborting the regionserver. >>>>> >>>>> However, in the case of any other unexpected exceptions out of the run() >>>>> method, how do we want to handle this? >>>>> >>>>> I'm of the general opinion that where we would be better off aborting on >>>>> all unexpected exceptions, as it means that: >>>>> a) we have some missing error handling >>>>> b) failing fast raises visibility and makes it easier to add any error >>>>> handling that should be there >>>>> c) silently stopping up replication creates problems that are difficult >>>>> for >>>>> our users to identify operationally and hard to troubleshoot. >>>>> >>>>> However, the current behavior has been there for quite a while, and >>>>> maybe >>>>> there are other situations or concerns I'm not seeing which would >>>>> justify >>>>> having regionserver stability over replication stability. >>>>> >>>>> What are folks thoughts on this? Should the regionserver abort on all >>>>> unexpected exceptions in the run method or should we more narrowly focus >>>>> this on OOME's? >>>>> >>>>