Wellington Chevreuil created HBASE-28884: --------------------------------------------
Summary: SFT's BrokenStoreFileCleaner may cause data loss Key: HBASE-28884 URL: https://issues.apache.org/jira/browse/HBASE-28884 Project: HBase Issue Type: Bug Reporter: Wellington Chevreuil Assignee: Wellington Chevreuil When having this BrokenStoreFileCleaner enabled, one of our customers has run into a data loss situation, probably due to a race condition between regions getting moved out of the regionserver while the BrokenStoreFileCleaner was checking this region's files eligibility for deletion. We have seen that the file got deleted by the given region server, around the same time the region got closed on this region server. I believe a race condition during region close is possible here: 1) In BrokenStoreFileCleaner, for each region online on the given RS, we get the list of files in the store dirs, then iterate through it [1]; 2) For each file listed, we perform several checks, including this one [2] that checks if the file is "active" The problem is, if the region for the file we are checking got closed between point #1 and #2, by the time we check if the file is active in [2], the store may have already been closed as part of the region closure, so this check would consider the file as deletable. One simple solution is to check if the store's region is still open before proceeding with deleting the file. [1] https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L99 [2] https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L133 -- This message was sent by Atlassian Jira (v8.20.10#820010)