zhuqi-lucas commented on issue #7363: URL: https://github.com/apache/arrow-rs/issues/7363#issuecomment-2851331211
@alamb @XiangpengHao Updated the polish PR with new commit: https://github.com/apache/arrow-rs/pull/7428/commits/d26de886685a8fc658b84d7f4e73b87243df5037 I found some of the regression comes from the page cache missing, so it will cause more time to decode page even we enable page cache, for example our default batch size for the clickbench is 8192, in Q 27 clickbench benchmark result, it will cause more than 20% page cache missing due to some batch > one page size , with above commit, it's performance will not have regression. Explanation details: ```rust Most cases: Assumption & observation: each page consists multiple batches. Then our pipeline looks like this: Load Page 1 Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1 Load batch 2 -> evaluate predicates -> filter 2 -> load & emit batch 2 Load batch 3 -> evaluate predicates -> filter 3 -> load & emit batch 3 Load Page 2 Load batch 4 -> evaluate predicates -> filter 4 -> load & emit batch 4 Load batch 5 -> evaluate predicates -> filter 5 -> load & emit batch 5 But some cases: Load Page1 Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1 Load Page2 Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1 Load Page3 Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1 When we load Page2, the cache will update to Page2, and next time we access the Page1, it will miss. ```rust -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
