bharathv commented on a change in pull request #3204:
URL: https://github.com/apache/hbase/pull/3204#discussion_r621850651



##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java
##########
@@ -641,8 +641,8 @@ public void removePeer(String id) {
             "and that replication on this peer may not be caught up. peerId=" 
+ id);
       }
       for (ReplicationSourceInterface toRemove : srcToRemove) {
-        toRemove.terminate(terminateMessage);
         closeQueue(toRemove);
+        toRemove.terminate(terminateMessage);

Review comment:
       Spoke with @sandeepvinayak. So the issue here was introduced by 
HBASE-25583. That patch introduced a peerRemoved() cleanup call from inside of 
ReplicationSourceShipperThread when any ZK call to update log position throws 
NoNodeException, meaning the peer was removed by some other process.
   
   (Ideally the peer should eventually get cleaned up due to a notification 
from ZK but we have repeatedly observed that there are occasional missed ZK 
notifications resulting in a weird ZK state, so the patch aggressively cleans 
up the peer state the first time it sees this exception).
   
   Now the problem is that peerRemoved() attempts to interrupt and cleanup the 
ReplicationSourceShipperThread from which it is being called and that is 
aborting the thread half way through and not performing the cleanup. Ideally 
that shouldn't happen but JVM has some weird semantics about thread 
interrupting itself, sigh.
   
   So the right way of fixing it is to keep the existing cleanup order as is 
(which is correct) but instead _not_ call peer cleanup routine from inside 
shipper/other replication threads. Currently the way the code is structured, 
there is no easy way (I can think of) to schedule a peer remove event without 
significant plumbing. A brute force way is to schedule it in the top level 
executor in ReplicationSourceManager, is there a better way?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to