zhuqi-lucas commented on PR #18817:
URL: https://github.com/apache/datafusion/pull/18817#issuecomment-3569492649

   > > Thank you @2010YOUY01 for review and valid concern:
   > > You raise valid concerns about memory overhead is what i mentioned the 
key risk for this approach.
   > > However, I want to clarify that row group reversal alone cannot 
eliminate the SortExec - it only provides TopK filtering benefits. Without 
reversing rows within each row group, the data remains in the original order 
(e.g., ASC when we need DESC), so the sort must stay. I propose we keep the 
complete optimization but default enable_reverse_scan to false. Once we 
implement page-level caching in arrow-rs (which will reduce memory overhead 
significantly), we can consider enabling it by default.
   > 
   > Did you mean 'cannot eliminate the SortExec(TopK)'? Just to confirm there 
is no global sort, but it is true that we have do a `topK` on a whole row group 
for this naive approach.
   > 
   > I have a intuition that for this kind of workload, the bottleneck is on 
the parquet decoding speed, and an extra `TopK` won't introduce much additional 
overhead, so this naive approach can also get pretty fast.
   > 
   > It makes a lot of sense that it's very hard to implement page/row level 
reversal in `arrow-rs` side, so we have to figure out how to do this at 
row-group level.
   > 
   > Summary: Perhaps we can start by adding a few end-to-end benchmarks that 
reflect your typical production workload. If this PR’s approach shows a clear 
improvement over the naive approach in [#18817 
(comment)](https://github.com/apache/datafusion/pull/18817#issuecomment-3568934764)
 (I'm happy to do a quick prototype), we should definitely move forward.
   
   
   
   Nice point  @2010YOUY01 , i agree most time will be decode page, i can 
change this PR to add the config to implement [#18817 
(comment)](https://github.com/apache/datafusion/pull/18817#issuecomment-3568934764),
 so we can have more options to compare, i agree the easier solution is better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to