[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

HBase Review Board (JIRA) Mon, 14 Jun 2010 11:44:40 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878692#action_12878692
 ]

HBase Review Board commented on HBASE-2223:
-------------------------------------------

Message from: "Jean-Daniel Cryans" <[email protected]>

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > bin/replication/add_peer.rb, line 21
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1104#file1104line21>
bq.  >
bq.  >     Should you point at some replication documentation here?  Is there 
such a thing?

package.html later, should I point to it?

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > bin/replication/copy_tables_desc.rb, line 58
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1105#file1105line58>
bq.  >
bq.  >     This could get a bit annoying I'd say.

It helped me a lot, remove if people complain?

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/HConstants.java, line 342
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1107#file1107line342>
bq.  >
bq.  >     This has to go here?  Can it go into one of the replication classes?

Used by master and region server, to me it belongs there.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/ServerManager.java, line 156
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1109#file1109line156>
bq.  >
bq.  >     Can't you just do c.get("key", defaultvalue)?

No, I also do a check on replication.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java, 
line 929
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1110#file1110line929>
bq.  >
bq.  >     You writing startcode into zk?  Why not write servername -- the 
host+port+startcode combo?

To be coherent with the rest of the code that uses zookeeper.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java, 
line 1075
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1110#file1110line1075>
bq.  >
bq.  >     Is this directory name?  Confusingly named given 
rootdir+regLogPathStr only adds up to repLogPath.

I don't understand you, but this code is going to be removed in my next patch 
as I'm simplifying RepSink.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 55
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line55>
bq.  >
bq.  >     Peers are named '1', '2'?  Can't we have more meaningful names here?

We agreed that peers are identified with a short internally as it is stored. We 
could use an external mapping of short->cute_name.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 59
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line59>
bq.  >
bq.  >     Use servername instead of startcode

Same comment as before, needs to be coherent.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 60
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line60>
bq.  >
bq.  >     All RS's in a master cluster replicate?

Yep... was that an implicit way of saying that I need to document that in RZH?

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 107
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line107>
bq.  >
bq.  >     Should this class be called WRapper instaad of Helper?

Sure

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 185
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line185>
bq.  >
bq.  >     You mean 'ensemble' here rather than 'quorum' (Patrick will kill you 
if he sees you calling it a 'quorum' when you mean the other)

Argh I'm trying to correct myself but I'm still missing some of them. Thx!

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 263
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line263>
bq.  >
bq.  >     We keep up the replication position in zk?  How much do we replicate 
in one go?  Its not a single edit, is it?  We do this for every log file?

Yes. A defined amount specified in ReplicationSource. No. Every current log 
file, we only replicate one at a time per region server.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 328
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line328>
bq.  >
bq.  >     LOG.warn instead?
bq.  >

I'll do like the rest and log.error

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 354
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line354>
bq.  >
bq.  >     We return empty map if clusters size is == 1?  Should that be 
clusters.size == 0?

That part isn't clear enough, so the reason it's 1 and not 0 is that we put a 
lock in there so it's listed in the znodes we fetch. Actually this should be <= 
1 rather than ==.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 356
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line356>
bq.  >
bq.  >     Whats this about?

See previous comment, we lock the dead region server's znode by putting a lock 
in there, but we don't want to process the hlogs under since... it's not a 
cluster. Could use more doc.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeperHelper.java,
 line 402
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1113#file1113line402>
bq.  >
bq.  >     Just logging errors?  What if session expired (our discussion from 
last day)?

Yes I need to review how I handle it in RZH, but I'd also need to review ZKW 
since some methods will hid it in there.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 41
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line41>
bq.  >
bq.  >     Call it alpha

yeah! (j/k)

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 64
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line64>
bq.  >
bq.  >     Whats this about?  You need to run zk yourself but no zoo.cfg?

I... don't remember why I wrote this.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 73
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line73>
bq.  >
bq.  >     And if not?  What if replicating single-family only?

Forgot to update that after we added scoping, updating.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 83
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line83>
bq.  >
bq.  >     Has to be offline?  Will this always be the case?

Currently everything is static, but I hope we can move on from that in the 
future.

bq.  On 2010-06-11 12:45:29, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/replication/package.html, line 108
bq.  > <http://review.hbase.org/r/76/diff/5/?file=1115#file1115line108>
bq.  >
bq.  >     whats ratio?

This is a log snippet that's coming from a region server. Do you want to see 
more documentation about it in package.html or in the logging itself?

- Jean-Daniel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/76/#review191
-----------------------------------------------------------

> Handle 10min+ network partitions between clusters
> -------------------------------------------------
>
>                 Key: HBASE-2223
>                 URL: https://issues.apache.org/jira/browse/HBASE-2223
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2223.patch
>
>
> We need a nice way of handling long network partitions without impacting a 
> master cluster (which pushes the data). Currently it will just retry over and 
> over again.
> I think we could:
>  - Stop replication to a slave cluster if it didn't respond for more than 10 
> minutes
>  - Keep track of the duration of the partition
>  - When the slave cluster comes back, initiate a MR job like HBASE-2221 
> Maybe we want less than 10 minutes, maybe we want this to be all automatic or 
> just the first 2 parts. Discuss.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2223) Handle 10min+ network partitions between clusters

Reply via email to