Hi,

I was doing performance testing of the CleanerChores using S3 root
directory with HBase 2.4. When the archive directory was large the HFile
cleaner threads were blocked on acquiring the SnapshotWriteLock. The flame
graph [1] showed that inside SnapshotFileCache.getUnreferencedFiles there
were a lot of listing operations to S3. The lock was held for 45 seconds
with a directory containing 1000 files.
I've also seen one case when a snapshot creation failed (timed out). I
assume that can also be related to the long locking.

The locking time can be drastically reduced by changing the
Iterable<FileStatus> parameter to List<FileStatus> for the
FileCleanerDelegate.getDeletableFiles. With this change, the lock was
released in approximately 100ms, however, the overall time for file
listing, evaluation, and deletion took the same time (45sec). Since the
different cleaner threads are not blocked on the snapshot lock the overall
deletion speed can be increased significantly.
CleanerChore.checkAndDeleteFiles already receives a List<FileStatus>
parameter, and later converts it to Iterable<FileStatus> so I don't expect
a drastic change in the Cleaners' memory usage.

I've done some testing with HDFS as well.
Cleanup for 100k files took 63518ms with Iterable implementation, the lock
was held for ~130ms for every 1000 files.
Cleanup for 100k files took 64411ms with List implementation, the lock was
held for ~2ms for every 1000 files.

Do you have any concerns about making this change?

Thanks,
Peter

[1] https://issues.apache.org/jira/browse/HBASE-27590
[2] https://github.com/apache/hbase/pull/4995

Reply via email to