Re: [DISCUSS] Re: Replication resiliency

Andrew Purtell Thu, 26 Jan 2017 21:04:01 -0800

Agreed, let's abort with an abundance of caution. That is the _least_ that 
should be done when a thread dies unexpectedly. Maybe we can improve resiliency 
for specific cases later.



On Jan 26, 2017, at 5:53 PM, Enis Söztutar <enis....@gmail.com> wrote:

>> Do we have worker threads that we can't safely continue without
> indefinitely? Can we solve the general problem of "unhandled exception
> in threads cause a RS Abort"?
> We have this already in the code base. We are injecting an
> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
> of the HRegionServer service threads (see HRS.startServiceThreads). But
> I've also seen cases where some important thread got unmarked. I think it
> is good idea to revisit that and make sure that all the threads are
> injected with the UEH.
> 
> The replication source threads are started on demand, that is why the UEH
> is not injected I think. But agreed that we should do the safe route here,
> and abort the regionserver.
> 
> Enis
> 
>> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser <els...@apache.org> wrote:
>> 
>> +1 If any "worker" thread can't safely/reasonably retry some unexpected
>> exception without a reasonable expectation of self-healing, tank the RS.
>> 
>> Having those threads die but not the RS could go uncaught for indefinite
>> period of time.
>> 
>> 
>> Sean Busbey wrote:
>> 
>>> I've noticed a few other places where we can lose a worker thread and
>>> the RegionServer happily continues. One notable example is the worker
>>> threads that handle syncs for the WAL. I'm generally a
>>> fail-fast-and-loud advocate, so I like aborting when things look
>>> wonky. I've also had to deal with a lot of pain around silent and thus
>>> hard to see replication failures, so strong signals that the
>>> replication system is in a bad way sound good to me atm.
>>> 
>>> Do we have worker threads that we can't safely continue without
>>> indefinitely? Can we solve the general problem of "unhandled exception
>>> in threads cause a RS Abort"?
>>> 
>>> As mentioned on the jira, I do worry a bit about cluster stability and
>>> cascading failures, given the ability to have user-provided endpoints
>>> in the replication process. Ultimately, I don't see that as different
>>> than all the other places coprocessors can put the cluster at risk.
>>> 
>>>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<bus...@apache.org>  wrote:
>>>> 
>>>> (edited subject to ensure folks filtering for DISCUSS see this)
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<ghelml...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Over in HBASE-17381 there has been some discussion around whether an
>>>>> unhandled exception in a ReplicationSourceWorkerThread should trigger a
>>>>> regionserver abort.
>>>>> 
>>>>> The current behavior in the case of an unexpected exception in
>>>>> ReplicationSourceWorkerThread.run() is to log a message and simply let
>>>>> the
>>>>> thread die, allowing replication for this source to back up.
>>>>> 
>>>>> I've seen this happen in an OOME scenario, which seems like a clear case
>>>>> where we would be better off aborting the regionserver.
>>>>> 
>>>>> However, in the case of any other unexpected exceptions out of the run()
>>>>> method, how do we want to handle this?
>>>>> 
>>>>> I'm of the general opinion that where we would be better off aborting on
>>>>> all unexpected exceptions, as it means that:
>>>>> a) we have some missing error handling
>>>>> b) failing fast raises visibility and makes it easier to add any error
>>>>> handling that should be there
>>>>> c) silently stopping up replication creates problems that are difficult
>>>>> for
>>>>> our users to identify operationally and hard to troubleshoot.
>>>>> 
>>>>> However, the current behavior has been there for quite a while, and
>>>>> maybe
>>>>> there are other situations or concerns I'm not seeing which would
>>>>> justify
>>>>> having regionserver stability over replication stability.
>>>>> 
>>>>> What are folks thoughts on this?  Should the regionserver abort on all
>>>>> unexpected exceptions in the run method or should we more narrowly focus
>>>>> this on OOME's?
>>>>> 
>>>>

Re: [DISCUSS] Re: Replication resiliency

Reply via email to