Hi, I was doing performance testing of the CleanerChores using S3 root directory with HBase 2.4. When the archive directory was large the HFile cleaner threads were blocked on acquiring the SnapshotWriteLock. The flame graph [1] showed that inside SnapshotFileCache.getUnreferencedFiles there were a lot of listing operations to S3. The lock was held for 45 seconds with a directory containing 1000 files. I've also seen one case when a snapshot creation failed (timed out). I assume that can also be related to the long locking.
The locking time can be drastically reduced by changing the Iterable<FileStatus> parameter to List<FileStatus> for the FileCleanerDelegate.getDeletableFiles. With this change, the lock was released in approximately 100ms, however, the overall time for file listing, evaluation, and deletion took the same time (45sec). Since the different cleaner threads are not blocked on the snapshot lock the overall deletion speed can be increased significantly. CleanerChore.checkAndDeleteFiles already receives a List<FileStatus> parameter, and later converts it to Iterable<FileStatus> so I don't expect a drastic change in the Cleaners' memory usage. I've done some testing with HDFS as well. Cleanup for 100k files took 63518ms with Iterable implementation, the lock was held for ~130ms for every 1000 files. Cleanup for 100k files took 64411ms with List implementation, the lock was held for ~2ms for every 1000 files. Do you have any concerns about making this change? Thanks, Peter [1] https://issues.apache.org/jira/browse/HBASE-27590 [2] https://github.com/apache/hbase/pull/4995
