[ 
https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445160#comment-13445160
 ] 

Jean-Daniel Cryans commented on HBASE-6695:
-------------------------------------------

Some comments on the patch:

 - In the worst case (namenode died and whole cluster is restarted), each 
region server will have as many NodeFailoverWorker thread spinning as there are 
region servers while the queues are moved. The compounding effect doing all the 
writes and checks might be a lot.
 - Moving is still not atomic enough, you can still have 1 HLog replayed by 2 
region servers and I'm not sure what's going to happen there.
                
> [Replication] Data will lose if RegionServer down during transferqueue
> ----------------------------------------------------------------------
>
>                 Key: HBASE-6695
>                 URL: https://issues.apache.org/jira/browse/HBASE-6695
>             Project: HBase
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 0.94.1
>            Reporter: terry zhang
>            Priority: Critical
>             Fix For: 0.96.0, 0.94.2
>
>         Attachments: HBASE-6695.patch
>
>
> When we ware testing Replication failover feature we found if we kill a 
> regionserver during it transferqueue ,we found only part of the hlog znode 
> copy to the right path because failover process is interrupted. 
> Log:
> 2012-08-29 12:20:05,660 INFO 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
> Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue
> 2012-08-29 12:20:05,765 DEBUG 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
> dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
> 2012-08-29 12:20:05,850 DEBUG 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
> dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data
> 2012-08-29 12:20:05,938 DEBUG 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
> dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data
> 2012-08-29 12:20:06,055 DEBUG 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
> dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data
> 2012-08-29 12:20:06,277 WARN 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
> Failed all from region=.ME
> TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
> java.util.concurrent.ExecutionException: java.net.ConnectException: 
> Connection refused
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> at 
> ......
> This server is down .....
> ZK node status:
> [zk: 10.232.98.77:2181(CONNECTED) 6] ls 
> /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
> [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]
>  
> dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to