ctubbsii opened a new issue, #2729: URL: https://github.com/apache/accumulo/issues/2729
**Is your feature request related to a problem? Please describe.** Compactions, splits, merges, and table clones are tricky and complicated because we have multiple references to the same files, making it difficult to know when it is safe to delete a file. We keep track of files in use and when we're done with them, we only mark them as candidates for deletion. We rely on a separate garbage collection service to ensure that a file is no longer in use before we can safely delete it. Even then, the garbage collection process can be slow, risky, and if it crashes, it may leave behind unreferenced files. **Describe the solution you'd like** HDFS has a kind of [HardLink](https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/fs/HardLink.html) feature that we may be able to leverage to avoid garbage collection entirely. I have not tested how it works in practice, but in theory, we could just create unique file names whenever we split a tablet or clone a table, or even bulk import files, if the files were made as hard links, rather than simply copying the same reference. This would probably increase the memory footprint of the Hadoop NameNode, but it would enable dramatic simplification of Accumulo, so it would probably be worth it. When we are done with a file, we could just delete it immediately, because we wouldn't have to worry about any other references. The actual blocks would still be referenced and not deleted, by the other hard links. We can let Hadoop reclaim the blocks when the last hard link is deleted. **Describe alternatives you've considered** Keep doing file-based garbage collection and hoping for the best. **Additional context** Doing this could simplify the implementation of "no-chop merges" described in #1327 because each file would reference only a single range in its metadata. To implement this, we may need some kind of global locking per file, to ensure a file can't be deleted while hard links are being created. We'd need to test to make sure that the original file can still be deleted... that it's treated like any other hard link, and that we can make hard links of hard links, etc. We might still want a garbage collection service to lazily clean up files, but we'd no longer have to do complicated reference checking for candidates, if we could rely on file names being globally unique. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
