Replication resiliency

Gary Helmling Thu, 26 Jan 2017 11:59:22 -0800

Over in HBASE-17381 there has been some discussion around whether an
unhandled exception in a ReplicationSourceWorkerThread should trigger a
regionserver abort.


The current behavior in the case of an unexpected exception in
ReplicationSourceWorkerThread.run() is to log a message and simply let the
thread die, allowing replication for this source to back up.

I've seen this happen in an OOME scenario, which seems like a clear case
where we would be better off aborting the regionserver.

However, in the case of any other unexpected exceptions out of the run()
method, how do we want to handle this?

I'm of the general opinion that where we would be better off aborting on
all unexpected exceptions, as it means that:
a) we have some missing error handling
b) failing fast raises visibility and makes it easier to add any error
handling that should be there
c) silently stopping up replication creates problems that are difficult for
our users to identify operationally and hard to troubleshoot.

However, the current behavior has been there for quite a while, and maybe
there are other situations or concerns I'm not seeing which would justify
having regionserver stability over replication stability.

What are folks thoughts on this?  Should the regionserver abort on all
unexpected exceptions in the run method or should we more narrowly focus
this on OOME's?

Replication resiliency

Reply via email to