Ashu Pachauri created HBASE-14621:
-------------------------------------
Summary: ReplicationLogCleaner gets stuck when a regionserver
crashes
Key: HBASE-14621
URL: https://issues.apache.org/jira/browse/HBASE-14621
Project: HBase
Issue Type: Bug
Components: Replication
Reporter: Ashu Pachauri
Assignee: Ashu Pachauri
Priority: Critical
The ReplicationLogCleaner has a bug that makes it get stuck in an infinite loop
when a regionserver crashes. This bug was introduced in the fix for
HBASE-12865; which makes sure that the loadWALsFromQueues method attempts a
retry whenever the replication node's cversion is changed in the middle of
loading the replication queue for the regionservers. However, if this scenario
actually happens (a regionserver crash in the middle of the operation), it will
get stuck in an infinite loop.
It has very serious ramifications because the old WALs are not cleaned up
because of this and in a high load environment, the file count in the oldWALs
directory soon exceeds the inode limit and the cluster goes down.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)