[ https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833227#action_12833227 ]
ryan rawson commented on HBASE-2223: ------------------------------------ so if we lose a replication stream, it is an uneven break, at this point we cant say 'we are missing all edits from TS=X to TS=Y' and kick off a map-reduce job to read them over. The central question is, do we want to avoid duplicate KeyValues as much as possible? I say yes, because it messes with the version checking and is in general sloppy. Also edits dont pile up that quickly on mainline serving systems... so in reality we arent talking about a 50TB log storage requirement. We should probably be tracking the status of all logfiles in zookeeper so we know who needs what and when. > Handle 10min+ network partitions between clusters > ------------------------------------------------- > > Key: HBASE-2223 > URL: https://issues.apache.org/jira/browse/HBASE-2223 > Project: Hadoop HBase > Issue Type: Sub-task > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Fix For: 0.21.0 > > > We need a nice way of handling long network partitions without impacting a > master cluster (which pushes the data). Currently it will just retry over and > over again. > I think we could: > - Stop replication to a slave cluster if it didn't respond for more than 10 > minutes > - Keep track of the duration of the partition > - When the slave cluster comes back, initiate a MR job like HBASE-2221 > Maybe we want less than 10 minutes, maybe we want this to be all automatic or > just the first 2 parts. Discuss. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.