[GitHub] [accumulo] ctubbsii opened a new issue, #3700: Consider using HDFS hard links to eliminate the need for an accumulo-gc service

via GitHub Wed, 16 Aug 2023 10:12:35 -0700


ctubbsii opened a new issue, #3700:
URL: https://github.com/apache/accumulo/issues/3700


   **Is your feature request related to a problem? Please describe.**
   The accumulo-gc currently scans to detect unused files, and delete them from 
the file system. However, this requires additional metadata storage to track 
the file deletion candidates, and requires a process to be running to delete 
these files after scanning the metadata table for current references. Scanning 
for current references can be problematic in terms of performance, and if there 
is an error that causes a reference to go unseen.
   
   **Describe the solution you'd like**
   Ideally, compactions that result in orphaned files will trigger the old file 
to be deleted right away, rather than left behind for the accumulo-gc service 
to delete later. However, this won't work right now, because tablet splits and 
table clones can cause files to be referenced by more than one tablet. If hard 
links were created for tablet splits and table clones, then they would have 
unique references without using more storage space, and tablets could safely 
delete their unique reference. The underlying filesystem would be responsible 
for freeing the underlying storage when the last hard link was removed for a 
given file. We would no longer need an accumulo-gc service at all.
   
   **Describe alternatives you've considered**
   Keep the accumulo-gc, or track file reference counts using conditional 
mutations, so it only deletes files when the reference count reaches 0.
   
   **Additional context**
   This proposed solution could trigger additional compaction IO after a merge 
that is unnecessary after the no-chop merge feature is complete. Without 
no-chop merges, the chop compactions will rewrite the files as smaller files, 
and then the smaller files will be subject to being recombined in a subsequent 
compaction of the merged tablet. However, with the no-chop merges feature, the 
chop compactions can be skipped and the two smaller ranges can be recombined 
only by updating the metadata, without IO. If the files appear with different 
names, so each tablet has a unique file reference and can delete the file when 
it wants to, then the ranges will not be easily recombined, and a merge could 
result in the two files being subject to a subsequent compaction, rather than 
simply combining their ranges on a merge.
   
   Using hard links could also make it more difficult to track storage use 
per-table, and may unnecessarily cause data to be copied more than it needs to 
be for table exports/imports.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [accumulo] ctubbsii opened a new issue, #3700: Consider using HDFS hard links to eliminate the need for an accumulo-gc service

Reply via email to