[replication] ReplicationSink shouldn't kill the whole RS when it fails to 
replicate
------------------------------------------------------------------------------------

                 Key: HBASE-3041
                 URL: https://issues.apache.org/jira/browse/HBASE-3041
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.89.20100924
            Reporter: Jean-Daniel Cryans
            Assignee: Jean-Daniel Cryans
             Fix For: 0.90.0


This is kind of a funny bug, as long as you don't run into it. I thought I'd be 
a good idea to kill the region servers that act as sinks when they can't 
replicate edits on their own cluster (this is often something we do in face of 
fatal errors throughout the code), but not so much.

So, last friday while I was using CopyTable to replicate data from a master to 
a slave cluster while the new data was being replicated, one table got really 
slow and took too long to split which tripped RetriesExhaustedException coming 
out of HTable in ReplicationSink. This killed a first region server, which was 
itself hosting regions. Splitting the logs took a bit longer since the cluster 
was under high insert load, so this triggered other exceptions in the other 
region servers, to a point where they were all down. I restarted the cluster, 
the master splits all the logs that were remaining and begins assigning 
regions. Some of them took too long to open because each region server had a 
few regions to recover each and the last ones in the queue were minutes from 
being opened. Since the master cluster was already pushing edits to the slave, 
the region servers all got RetriesExhausted and all went down again. I changed 
the client pause from 1 to 3 and restarted, same happened. I changed it to 5, 
and finally was able to keep the cluster up. Fortunately, the master cluster 
was queueing up the HLogs so we didn't lose any data and the backlog was 
replicated in a few minutes.

So, instead of killing the region server, any exception coming out of HTable 
should just be treated as a failure to apply and the source cluster should 
retry later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to