[
https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445239#comment-13445239
]
Chris Trezzo commented on HBASE-6695:
-------------------------------------
Shouldn't the following line:
{code}
if (ZKUtil.checkExists(zkHelper.getZookeeperWatcher(), rsZnode) !=
-1) {
{code}
be this instead:
{code}
if (ZKUtil.checkExists(zkHelper.getZookeeperWatcher(), rsZnode) ==
-1) {
{code}
If the node is gone, then we can stop trying to fail it over. If it still
exists, we need to keep trying to fail it over.
I say dup this and continue conversation in HBASE-2611 since they are
addressing the exact same issue.
> [Replication] Data will lose if RegionServer down during transferqueue
> ----------------------------------------------------------------------
>
> Key: HBASE-6695
> URL: https://issues.apache.org/jira/browse/HBASE-6695
> Project: HBase
> Issue Type: Bug
> Components: replication
> Affects Versions: 0.94.1
> Reporter: terry zhang
> Priority: Critical
> Fix For: 0.96.0, 0.94.3
>
> Attachments: HBASE-6695.patch
>
>
> When we ware testing Replication failover feature we found if we kill a
> regionserver during it transferqueue ,we found only part of the hlog znode
> copy to the right path because failover process is interrupted.
> Log:
> 2012-08-29 12:20:05,660 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue
> 2012-08-29 12:20:05,765 DEBUG
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating
> dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
> 2012-08-29 12:20:05,850 DEBUG
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating
> dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data
> 2012-08-29 12:20:05,938 DEBUG
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating
> dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data
> 2012-08-29 12:20:06,055 DEBUG
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating
> dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data
> 2012-08-29 12:20:06,277 WARN
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Failed all from region=.ME
> TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
> java.util.concurrent.ExecutionException: java.net.ConnectException:
> Connection refused
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> at
> ......
> This server is down .....
> ZK node status:
> [zk: 10.232.98.77:2181(CONNECTED) 6] ls
> /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
> [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]
>
> dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira