[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Jean-Daniel Cryans (JIRA) Fri, 12 Feb 2010 14:08:55 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833192#action_12833192
 ]


Jean-Daniel Cryans commented on HBASE-2223:
-------------------------------------------

bq. Or just restart replication after the slave is back online and interleave 
edits from the queue with new ones as necessary?

One thing I forgot to add is that the job would be configured to only treat 
timestamps newer than x.

So the problem with resending those edits is something I tackled in HBASE-2197. 
If one cluster gets very very late like 2 hours, we have to decide where we are 
going to get that data. One option is using the old log files but also the log 
files that are currently in the region servers. It ain't so bad, but what 
happens in the case of failure? In 2197, the first solution I described 
involves using a distributed queue where all RS would participate in processing 
each log file and interleave them with the rest of the stream.

Another option is keeping yet another set of log files, separate from the 
"normal" ones, that we use to flush log entries if some cluster gets late. Then 
if a region server dies, we process both sets of log files.

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>
> We need a nice way of handling long network partitions without impacting a 
> master cluster (which pushes the data). Currently it will just retry over and 
> over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 
> minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or 
> just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Reply via email to