Josh, probably worth checking if a grep or something else we can do in precommit could catch this.
On Fri, Jan 27, 2017 at 1:26 PM, Josh Elser <els...@apache.org> wrote: > Cool. > > Let me open an issue to scan the codebase to see if we can find any > instances where we are starting threads which aren't using the UEH. > > > Andrew Purtell wrote: >> >> Agreed, let's abort with an abundance of caution. That is the _least_ that >> should be done when a thread dies unexpectedly. Maybe we can improve >> resiliency for specific cases later. >> >> >> On Jan 26, 2017, at 5:53 PM, Enis Söztutar<enis....@gmail.com> wrote: >> >>>> Do we have worker threads that we can't safely continue without >>> >>> indefinitely? Can we solve the general problem of "unhandled exception >>> in threads cause a RS Abort"? >>> We have this already in the code base. We are injecting an >>> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all >>> of the HRegionServer service threads (see HRS.startServiceThreads). But >>> I've also seen cases where some important thread got unmarked. I think it >>> is good idea to revisit that and make sure that all the threads are >>> injected with the UEH. >>> >>> The replication source threads are started on demand, that is why the UEH >>> is not injected I think. But agreed that we should do the safe route >>> here, >>> and abort the regionserver. >>> >>> Enis >>> >>>> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser<els...@apache.org> wrote: >>>> >>>> +1 If any "worker" thread can't safely/reasonably retry some unexpected >>>> exception without a reasonable expectation of self-healing, tank the RS. >>>> >>>> Having those threads die but not the RS could go uncaught for indefinite >>>> period of time. >>>> >>>> >>>> Sean Busbey wrote: >>>> >>>>> I've noticed a few other places where we can lose a worker thread and >>>>> the RegionServer happily continues. One notable example is the worker >>>>> threads that handle syncs for the WAL. I'm generally a >>>>> fail-fast-and-loud advocate, so I like aborting when things look >>>>> wonky. I've also had to deal with a lot of pain around silent and thus >>>>> hard to see replication failures, so strong signals that the >>>>> replication system is in a bad way sound good to me atm. >>>>> >>>>> Do we have worker threads that we can't safely continue without >>>>> indefinitely? Can we solve the general problem of "unhandled exception >>>>> in threads cause a RS Abort"? >>>>> >>>>> As mentioned on the jira, I do worry a bit about cluster stability and >>>>> cascading failures, given the ability to have user-provided endpoints >>>>> in the replication process. Ultimately, I don't see that as different >>>>> than all the other places coprocessors can put the cluster at risk. >>>>> >>>>>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<bus...@apache.org> >>>>>> wrote: >>>>>> >>>>>> (edited subject to ensure folks filtering for DISCUSS see this) >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<ghelml...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Over in HBASE-17381 there has been some discussion around whether an >>>>>>> unhandled exception in a ReplicationSourceWorkerThread should trigger >>>>>>> a >>>>>>> regionserver abort. >>>>>>> >>>>>>> The current behavior in the case of an unexpected exception in >>>>>>> ReplicationSourceWorkerThread.run() is to log a message and simply >>>>>>> let >>>>>>> the >>>>>>> thread die, allowing replication for this source to back up. >>>>>>> >>>>>>> I've seen this happen in an OOME scenario, which seems like a clear >>>>>>> case >>>>>>> where we would be better off aborting the regionserver. >>>>>>> >>>>>>> However, in the case of any other unexpected exceptions out of the >>>>>>> run() >>>>>>> method, how do we want to handle this? >>>>>>> >>>>>>> I'm of the general opinion that where we would be better off aborting >>>>>>> on >>>>>>> all unexpected exceptions, as it means that: >>>>>>> a) we have some missing error handling >>>>>>> b) failing fast raises visibility and makes it easier to add any >>>>>>> error >>>>>>> handling that should be there >>>>>>> c) silently stopping up replication creates problems that are >>>>>>> difficult >>>>>>> for >>>>>>> our users to identify operationally and hard to troubleshoot. >>>>>>> >>>>>>> However, the current behavior has been there for quite a while, and >>>>>>> maybe >>>>>>> there are other situations or concerns I'm not seeing which would >>>>>>> justify >>>>>>> having regionserver stability over replication stability. >>>>>>> >>>>>>> What are folks thoughts on this? Should the regionserver abort on >>>>>>> all >>>>>>> unexpected exceptions in the run method or should we more narrowly >>>>>>> focus >>>>>>> this on OOME's? >>>>>>> >