Re: [PR] Prefetch blocks and place into data BlockCache for major compactions [accumulo]

via GitHub Tue, 04 Feb 2025 11:44:49 -0800


dlmarion commented on PR #5302:
URL: https://github.com/apache/accumulo/pull/5302#issuecomment-2634908149


   Looking at the new [vectored read API 
](https://issues.apache.org/jira/browse/HADOOP-11867)in Hadoop has been on my 
todo list.  Another good resource for understanding it is 
[here](https://www.apachecon.com/acna2022/slides/02_Thakur_Hadoop_Vectored_IO.pdf).
 I attempted to use this, but was unable to figure out a good way to use it as 
we don't directly deal with HDFS blocks. Instead, we deal with RFile blocks, 
and we cache them, at a much different layer than where the HDFS block is 
retrieved.
   
   Instead I attempted to create something similar in this PR, prefetching 
RFile blocks and preemptively caching them. I think this might make sense for 
operations that perform sequential reads, like compactions. So I wired this up 
in the FileCompactor for major compactions, and I targeted the main branch 
because major compactions only run in Compactors. In earlier releases this 
change would cause churn in the data block cache and might cause a decrease in 
scan performance due to eviction of other blocks.
   
   There are still some changes to be made, like making the number of blocks to 
prefetch a property, and moving the ThreadPoolExecutor out of the Reader and 
somewhere else. But wanted to get early feedback on the concept before putting 
more work into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Prefetch blocks and place into data BlockCache for major compactions [accumulo]

Reply via email to