[
https://issues.apache.org/jira/browse/HBASE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans resolved HBASE-3041.
---------------------------------------
Resolution: Fixed
Hadoop Flags: [Reviewed]
Committed to trunk.
> [replication] ReplicationSink shouldn't kill the whole RS when it fails to
> replicate
> ------------------------------------------------------------------------------------
>
> Key: HBASE-3041
> URL: https://issues.apache.org/jira/browse/HBASE-3041
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.89.20100924
> Reporter: Jean-Daniel Cryans
> Assignee: Jean-Daniel Cryans
> Fix For: 0.90.0
>
> Attachments: HBASE-3041.patch
>
>
> This is kind of a funny bug, as long as you don't run into it. I thought I'd
> be a good idea to kill the region servers that act as sinks when they can't
> replicate edits on their own cluster (this is often something we do in face
> of fatal errors throughout the code), but not so much.
> So, last friday while I was using CopyTable to replicate data from a master
> to a slave cluster while the new data was being replicated, one table got
> really slow and took too long to split which tripped
> RetriesExhaustedException coming out of HTable in ReplicationSink. This
> killed a first region server, which was itself hosting regions. Splitting the
> logs took a bit longer since the cluster was under high insert load, so this
> triggered other exceptions in the other region servers, to a point where they
> were all down. I restarted the cluster, the master splits all the logs that
> were remaining and begins assigning regions. Some of them took too long to
> open because each region server had a few regions to recover each and the
> last ones in the queue were minutes from being opened. Since the master
> cluster was already pushing edits to the slave, the region servers all got
> RetriesExhausted and all went down again. I changed the client pause from 1
> to 3 and restarted, same happened. I changed it to 5, and finally was able to
> keep the cluster up. Fortunately, the master cluster was queueing up the
> HLogs so we didn't lose any data and the backlog was replicated in a few
> minutes.
> So, instead of killing the region server, any exception coming out of HTable
> should just be treated as a failure to apply and the source cluster should
> retry later.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.