Re: [DISCUSS] Re: Replication resiliency

Sean Busbey Fri, 27 Jan 2017 13:04:11 -0800

Josh, probably worth checking if a grep or something else we can do in
precommit could catch this.




On Fri, Jan 27, 2017 at 1:26 PM, Josh Elser <els...@apache.org> wrote:
> Cool.
>
> Let me open an issue to scan the codebase to see if we can find any
> instances where we are starting threads which aren't using the UEH.
>
>
> Andrew Purtell wrote:
>>
>> Agreed, let's abort with an abundance of caution. That is the _least_ that
>> should be done when a thread dies unexpectedly. Maybe we can improve
>> resiliency for specific cases later.
>>
>>
>> On Jan 26, 2017, at 5:53 PM, Enis Söztutar<enis....@gmail.com>  wrote:
>>
>>>> Do we have worker threads that we can't safely continue without
>>>
>>> indefinitely? Can we solve the general problem of "unhandled exception
>>> in threads cause a RS Abort"?
>>> We have this already in the code base. We are injecting an
>>> UncaughtExceptionhandler (which is calling Abortable.abort) to almost all
>>> of the HRegionServer service threads (see HRS.startServiceThreads). But
>>> I've also seen cases where some important thread got unmarked. I think it
>>> is good idea to revisit that and make sure that all the threads are
>>> injected with the UEH.
>>>
>>> The replication source threads are started on demand, that is why the UEH
>>> is not injected I think. But agreed that we should do the safe route
>>> here,
>>> and abort the regionserver.
>>>
>>> Enis
>>>
>>>> On Thu, Jan 26, 2017 at 2:19 PM, Josh Elser<els...@apache.org>  wrote:
>>>>
>>>> +1 If any "worker" thread can't safely/reasonably retry some unexpected
>>>> exception without a reasonable expectation of self-healing, tank the RS.
>>>>
>>>> Having those threads die but not the RS could go uncaught for indefinite
>>>> period of time.
>>>>
>>>>
>>>> Sean Busbey wrote:
>>>>
>>>>> I've noticed a few other places where we can lose a worker thread and
>>>>> the RegionServer happily continues. One notable example is the worker
>>>>> threads that handle syncs for the WAL. I'm generally a
>>>>> fail-fast-and-loud advocate, so I like aborting when things look
>>>>> wonky. I've also had to deal with a lot of pain around silent and thus
>>>>> hard to see replication failures, so strong signals that the
>>>>> replication system is in a bad way sound good to me atm.
>>>>>
>>>>> Do we have worker threads that we can't safely continue without
>>>>> indefinitely? Can we solve the general problem of "unhandled exception
>>>>> in threads cause a RS Abort"?
>>>>>
>>>>> As mentioned on the jira, I do worry a bit about cluster stability and
>>>>> cascading failures, given the ability to have user-provided endpoints
>>>>> in the replication process. Ultimately, I don't see that as different
>>>>> than all the other places coprocessors can put the cluster at risk.
>>>>>
>>>>>> On Thu, Jan 26, 2017 at 2:48 PM, Sean Busbey<bus...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> (edited subject to ensure folks filtering for DISCUSS see this)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 26, 2017 at 1:58 PM, Gary Helmling<ghelml...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Over in HBASE-17381 there has been some discussion around whether an
>>>>>>> unhandled exception in a ReplicationSourceWorkerThread should trigger
>>>>>>> a
>>>>>>> regionserver abort.
>>>>>>>
>>>>>>> The current behavior in the case of an unexpected exception in
>>>>>>> ReplicationSourceWorkerThread.run() is to log a message and simply
>>>>>>> let
>>>>>>> the
>>>>>>> thread die, allowing replication for this source to back up.
>>>>>>>
>>>>>>> I've seen this happen in an OOME scenario, which seems like a clear
>>>>>>> case
>>>>>>> where we would be better off aborting the regionserver.
>>>>>>>
>>>>>>> However, in the case of any other unexpected exceptions out of the
>>>>>>> run()
>>>>>>> method, how do we want to handle this?
>>>>>>>
>>>>>>> I'm of the general opinion that where we would be better off aborting
>>>>>>> on
>>>>>>> all unexpected exceptions, as it means that:
>>>>>>> a) we have some missing error handling
>>>>>>> b) failing fast raises visibility and makes it easier to add any
>>>>>>> error
>>>>>>> handling that should be there
>>>>>>> c) silently stopping up replication creates problems that are
>>>>>>> difficult
>>>>>>> for
>>>>>>> our users to identify operationally and hard to troubleshoot.
>>>>>>>
>>>>>>> However, the current behavior has been there for quite a while, and
>>>>>>> maybe
>>>>>>> there are other situations or concerns I'm not seeing which would
>>>>>>> justify
>>>>>>> having regionserver stability over replication stability.
>>>>>>>
>>>>>>> What are folks thoughts on this?  Should the regionserver abort on
>>>>>>> all
>>>>>>> unexpected exceptions in the run method or should we more narrowly
>>>>>>> focus
>>>>>>> this on OOME's?
>>>>>>>
>

Re: [DISCUSS] Re: Replication resiliency

Reply via email to