[jira] Resolved: (HBASE-3041) [replication] ReplicationSink shouldn't kill the whole RS when it fails to replicate

Jean-Daniel Cryans (JIRA) Thu, 14 Oct 2010 17:21:56 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Daniel Cryans resolved HBASE-3041.
---------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed to trunk.

> [replication] ReplicationSink shouldn't kill the whole RS when it fails to 
> replicate
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-3041
>                 URL: https://issues.apache.org/jira/browse/HBASE-3041
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3041.patch
>
>
> This is kind of a funny bug, as long as you don't run into it. I thought I'd 
> be a good idea to kill the region servers that act as sinks when they can't 
> replicate edits on their own cluster (this is often something we do in face 
> of fatal errors throughout the code), but not so much.
> So, last friday while I was using CopyTable to replicate data from a master 
> to a slave cluster while the new data was being replicated, one table got 
> really slow and took too long to split which tripped 
> RetriesExhaustedException coming out of HTable in ReplicationSink. This 
> killed a first region server, which was itself hosting regions. Splitting the 
> logs took a bit longer since the cluster was under high insert load, so this 
> triggered other exceptions in the other region servers, to a point where they 
> were all down. I restarted the cluster, the master splits all the logs that 
> were remaining and begins assigning regions. Some of them took too long to 
> open because each region server had a few regions to recover each and the 
> last ones in the queue were minutes from being opened. Since the master 
> cluster was already pushing edits to the slave, the region servers all got 
> RetriesExhausted and all went down again. I changed the client pause from 1 
> to 3 and restarted, same happened. I changed it to 5, and finally was able to 
> keep the cluster up. Fortunately, the master cluster was queueing up the 
> HLogs so we didn't lose any data and the backlog was replicated in a few 
> minutes.
> So, instead of killing the region server, any exception coming out of HTable 
> should just be treated as a failure to apply and the source cluster should 
> retry later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-3041) [replication] ReplicationSink shouldn't kill the whole RS when it fails to replicate

Reply via email to