Ashu Pachauri created HBASE-14621:
-------------------------------------

             Summary: ReplicationLogCleaner gets stuck when a regionserver 
crashes
                 Key: HBASE-14621
                 URL: https://issues.apache.org/jira/browse/HBASE-14621
             Project: HBase
          Issue Type: Bug
          Components: Replication
            Reporter: Ashu Pachauri
            Assignee: Ashu Pachauri
            Priority: Critical


The ReplicationLogCleaner has a bug that makes it get stuck in an infinite loop 
when a regionserver crashes. This bug was introduced in the fix for 
HBASE-12865; which makes sure that the loadWALsFromQueues method attempts a 
retry whenever the replication node's cversion is changed in the middle of 
loading the replication queue for the regionservers. However, if this scenario 
actually happens (a regionserver crash in the middle of the operation), it will 
get stuck in an infinite loop.

It has very serious ramifications because the old WALs are not cleaned up 
because of this and in a high load environment, the file count in the oldWALs 
directory soon exceeds the inode limit and the cluster goes down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to