Jean-Daniel Cryans created HBASE-10482:
------------------------------------------

             Summary: ReplicationSyncUp doesn't clean up its ZK, needed for 
tests
                 Key: HBASE-10482
                 URL: https://issues.apache.org/jira/browse/HBASE-10482
             Project: HBase
          Issue Type: Bug
          Components: Replication
    Affects Versions: 0.94.16, 0.96.1
            Reporter: Jean-Daniel Cryans
            Assignee: Jean-Daniel Cryans
             Fix For: 0.96.2, 0.98.1, 0.99.0, 0.94.17


TestReplicationSyncUpTool failed again:
https://builds.apache.org/job/HBase-TRUNK/4895/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationSyncUpTool/testSyncUpTool/

It's not super obvious why only one of the two tables is replicated, the test 
could use some more logging, but I understand it this way:

The first ReplicationSyncUp gets started and for some reason it cannot 
replicate the data:

{noformat}
2014-02-06 21:32:19,811 INFO  [Thread-1372] 
regionserver.ReplicationSourceManager(203): Current list of replicators: 
[1391722339091.SyncUpTool.replication.org,1234,1, 
quirinus.apache.org,37045,1391722237951, 
quirinus.apache.org,33502,1391722238125] other RSs: []
2014-02-06 21:32:19,811 INFO  [Thread-1372.replicationSource,1] 
regionserver.ReplicationSource(231): Replicating 
db42e7fc-7f29-4038-9292-d85ea8b9994b -> 783c0ab2-4ff9-4dc0-bb38-86bf31d1d817
2014-02-06 21:32:19,892 TRACE [Thread-1372.replicationSource,2] 
regionserver.ReplicationSource(596): No log to process, sleeping 100 times 1
2014-02-06 21:32:19,911 TRACE [Thread-1372.replicationSource,1] 
regionserver.ReplicationSource(596): No log to process, sleeping 100 times 1
2014-02-06 21:32:20,094 TRACE [Thread-1372.replicationSource,2] 
regionserver.ReplicationSource(596): No log to process, sleeping 100 times 2
...
2014-02-06 21:32:23,414 TRACE [Thread-1372.replicationSource,1] 
regionserver.ReplicationSource(596): No log to process, sleeping 100 times 8
2014-02-06 21:32:23,673 INFO  [ReplicationExecutor-0] 
replication.ReplicationQueuesZKImpl(169): Moving 
quirinus.apache.org,37045,1391722237951's hlogs to my queue
2014-02-06 21:32:23,768 DEBUG [ReplicationExecutor-0] 
replication.ReplicationQueuesZKImpl(396): Creating 
quirinus.apache.org%2C37045%2C1391722237951.1391722243779 with data 10803
2014-02-06 21:32:23,842 DEBUG [ReplicationExecutor-0] 
replication.ReplicationQueuesZKImpl(396): Creating 
quirinus.apache.org%2C37045%2C1391722237951.1391722243779 with data 10803
2014-02-06 21:32:24,297 TRACE [Thread-1372.replicationSource,2] 
regionserver.ReplicationSource(596): No log to process, sleeping 100 times 9
2014-02-06 21:32:24,314 TRACE [Thread-1372.replicationSource,1] 
regionserver.ReplicationSource(596): No log to process, sleeping 100 times 9
{noformat}

Finally it gives up:

{noformat}
2014-02-06 21:32:30,873 DEBUG [Thread-1372] 
replication.TestReplicationSyncUpTool(323): SyncUpAfterDelete failed at retry = 
0, with rowCount_ht1TargetPeer1 =100 and rowCount_ht2TargetAtPeer1 =200
{noformat}

The syncUp tool has an ID you can follow, grep for 
syncupReplication1391722338885 or just the timestamp, and you can see it doing 
things after that. The reason is that the tool closes the 
ReplicationSourceManager but not the ZK connection, so events _still_ come in 
and NodeFailoverWorker _still_ tries to recover queues but then there's nothing 
to process them.

Later in the logs you can see:

{noformat}
2014-02-06 21:32:37,381 INFO  [ReplicationExecutor-0] 
replication.ReplicationQueuesZKImpl(169): Moving 
quirinus.apache.org,33502,1391722238125's hlogs to my queue
2014-02-06 21:32:37,567 INFO  [ReplicationExecutor-0] 
replication.ReplicationQueuesZKImpl(239): Won't transfer the queue, another RS 
took care of it because of: KeeperErrorCode = NoNode for 
/1/replication/rs/quirinus.apache.org,33502,1391722238125/lock
{noformat}

There shouldn't' be any racing, but now someone already moved 
"quirinus.apache.org,33502,1391722238125" away.

FWIW I can't even make the test fail on my machine so I'm not 100% sure closing 
the ZK connection fixes the issue, but at least it's the right thing to do.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to