[ 
https://issues.apache.org/jira/browse/HBASE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895166#comment-13895166
 ] 

Demai Ni commented on HBASE-10482:
----------------------------------

+1 from me. 
[~jdcryans], thank you very much. 

> ReplicationSyncUp doesn't clean up its ZK, needed for tests
> -----------------------------------------------------------
>
>                 Key: HBASE-10482
>                 URL: https://issues.apache.org/jira/browse/HBASE-10482
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.96.1, 0.94.16
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.98.1, 0.99.0, 0.94.17
>
>         Attachments: HBASE-10249.patch
>
>
> TestReplicationSyncUpTool failed again:
> https://builds.apache.org/job/HBase-TRUNK/4895/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationSyncUpTool/testSyncUpTool/
> It's not super obvious why only one of the two tables is replicated, the test 
> could use some more logging, but I understand it this way:
> The first ReplicationSyncUp gets started and for some reason it cannot 
> replicate the data:
> {noformat}
> 2014-02-06 21:32:19,811 INFO  [Thread-1372] 
> regionserver.ReplicationSourceManager(203): Current list of replicators: 
> [1391722339091.SyncUpTool.replication.org,1234,1, 
> quirinus.apache.org,37045,1391722237951, 
> quirinus.apache.org,33502,1391722238125] other RSs: []
> 2014-02-06 21:32:19,811 INFO  [Thread-1372.replicationSource,1] 
> regionserver.ReplicationSource(231): Replicating 
> db42e7fc-7f29-4038-9292-d85ea8b9994b -> 783c0ab2-4ff9-4dc0-bb38-86bf31d1d817
> 2014-02-06 21:32:19,892 TRACE [Thread-1372.replicationSource,2] 
> regionserver.ReplicationSource(596): No log to process, sleeping 100 times 1
> 2014-02-06 21:32:19,911 TRACE [Thread-1372.replicationSource,1] 
> regionserver.ReplicationSource(596): No log to process, sleeping 100 times 1
> 2014-02-06 21:32:20,094 TRACE [Thread-1372.replicationSource,2] 
> regionserver.ReplicationSource(596): No log to process, sleeping 100 times 2
> ...
> 2014-02-06 21:32:23,414 TRACE [Thread-1372.replicationSource,1] 
> regionserver.ReplicationSource(596): No log to process, sleeping 100 times 8
> 2014-02-06 21:32:23,673 INFO  [ReplicationExecutor-0] 
> replication.ReplicationQueuesZKImpl(169): Moving 
> quirinus.apache.org,37045,1391722237951's hlogs to my queue
> 2014-02-06 21:32:23,768 DEBUG [ReplicationExecutor-0] 
> replication.ReplicationQueuesZKImpl(396): Creating 
> quirinus.apache.org%2C37045%2C1391722237951.1391722243779 with data 10803
> 2014-02-06 21:32:23,842 DEBUG [ReplicationExecutor-0] 
> replication.ReplicationQueuesZKImpl(396): Creating 
> quirinus.apache.org%2C37045%2C1391722237951.1391722243779 with data 10803
> 2014-02-06 21:32:24,297 TRACE [Thread-1372.replicationSource,2] 
> regionserver.ReplicationSource(596): No log to process, sleeping 100 times 9
> 2014-02-06 21:32:24,314 TRACE [Thread-1372.replicationSource,1] 
> regionserver.ReplicationSource(596): No log to process, sleeping 100 times 9
> {noformat}
> Finally it gives up:
> {noformat}
> 2014-02-06 21:32:30,873 DEBUG [Thread-1372] 
> replication.TestReplicationSyncUpTool(323): SyncUpAfterDelete failed at retry 
> = 0, with rowCount_ht1TargetPeer1 =100 and rowCount_ht2TargetAtPeer1 =200
> {noformat}
> The syncUp tool has an ID you can follow, grep for 
> syncupReplication1391722338885 or just the timestamp, and you can see it 
> doing things after that. The reason is that the tool closes the 
> ReplicationSourceManager but not the ZK connection, so events _still_ come in 
> and NodeFailoverWorker _still_ tries to recover queues but then there's 
> nothing to process them.
> Later in the logs you can see:
> {noformat}
> 2014-02-06 21:32:37,381 INFO  [ReplicationExecutor-0] 
> replication.ReplicationQueuesZKImpl(169): Moving 
> quirinus.apache.org,33502,1391722238125's hlogs to my queue
> 2014-02-06 21:32:37,567 INFO  [ReplicationExecutor-0] 
> replication.ReplicationQueuesZKImpl(239): Won't transfer the queue, another 
> RS took care of it because of: KeeperErrorCode = NoNode for 
> /1/replication/rs/quirinus.apache.org,33502,1391722238125/lock
> {noformat}
> There shouldn't' be any racing, but now someone already moved 
> "quirinus.apache.org,33502,1391722238125" away.
> FWIW I can't even make the test fail on my machine so I'm not 100% sure 
> closing the ZK connection fixes the issue, but at least it's the right thing 
> to do.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to