dlmarion commented on PR #5302: URL: https://github.com/apache/accumulo/pull/5302#issuecomment-2634908149
Looking at the new [vectored read API ](https://issues.apache.org/jira/browse/HADOOP-11867)in Hadoop has been on my todo list. Another good resource for understanding it is [here](https://www.apachecon.com/acna2022/slides/02_Thakur_Hadoop_Vectored_IO.pdf). I attempted to use this, but was unable to figure out a good way to use it as we don't directly deal with HDFS blocks. Instead, we deal with RFile blocks, and we cache them, at a much different layer than where the HDFS block is retrieved. Instead I attempted to create something similar in this PR, prefetching RFile blocks and preemptively caching them. I think this might make sense for operations that perform sequential reads, like compactions. So I wired this up in the FileCompactor for major compactions, and I targeted the main branch because major compactions only run in Compactors. In earlier releases this change would cause churn in the data block cache and might cause a decrease in scan performance due to eviction of other blocks. There are still some changes to be made, like making the number of blocks to prefetch a property, and moving the ThreadPoolExecutor out of the Reader and somewhere else. But wanted to get early feedback on the concept before putting more work into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
