ctubbsii opened a new issue, #3700: URL: https://github.com/apache/accumulo/issues/3700
**Is your feature request related to a problem? Please describe.** The accumulo-gc currently scans to detect unused files, and delete them from the file system. However, this requires additional metadata storage to track the file deletion candidates, and requires a process to be running to delete these files after scanning the metadata table for current references. Scanning for current references can be problematic in terms of performance, and if there is an error that causes a reference to go unseen. **Describe the solution you'd like** Ideally, compactions that result in orphaned files will trigger the old file to be deleted right away, rather than left behind for the accumulo-gc service to delete later. However, this won't work right now, because tablet splits and table clones can cause files to be referenced by more than one tablet. If hard links were created for tablet splits and table clones, then they would have unique references without using more storage space, and tablets could safely delete their unique reference. The underlying filesystem would be responsible for freeing the underlying storage when the last hard link was removed for a given file. We would no longer need an accumulo-gc service at all. **Describe alternatives you've considered** Keep the accumulo-gc, or track file reference counts using conditional mutations, so it only deletes files when the reference count reaches 0. **Additional context** This proposed solution could trigger additional compaction IO after a merge that is unnecessary after the no-chop merge feature is complete. Without no-chop merges, the chop compactions will rewrite the files as smaller files, and then the smaller files will be subject to being recombined in a subsequent compaction of the merged tablet. However, with the no-chop merges feature, the chop compactions can be skipped and the two smaller ranges can be recombined only by updating the metadata, without IO. If the files appear with different names, so each tablet has a unique file reference and can delete the file when it wants to, then the ranges will not be easily recombined, and a merge could result in the two files being subject to a subsequent compaction, rather than simply combining their ranges on a merge. Using hard links could also make it more difficult to track storage use per-table, and may unnecessarily cause data to be copied more than it needs to be for table exports/imports. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org