Re: [I] Fast parquet order inversion [datafusion]

via GitHub Thu, 21 Aug 2025 11:23:36 -0700


suremarc commented on issue #17172:
URL: https://github.com/apache/datafusion/issues/17172#issuecomment-3211643923


   > [@crepererum](https://github.com/crepererum) Thank you for bringing up the 
idea.
   > 
   > My colleague [@suremarc](https://github.com/suremarc) has written a 
related issue in the arrow-rs repo: 
[apache/arrow-rs#3922](https://github.com/apache/arrow-rs/issues/3922). And 
currently, we have an implementation for this.
   > 
   > Looking forward to collaborating.
   
   Yes, this was a pretty gnarly issue and we ended up writing a reverse 
parquet reader that reads entire row groups into memory one-by-one and reverses 
each one in memory (using the Arrow `take` kernel as mentioned in this thread). 
Then we have a `ReverseOrder` optimizer that runs before `EnforceSorting` that 
looks for opportunities to reverse a Parquet scan if doing so would eliminate 
sorts. (On that note, it would be nice if DataFusion execution plans supported 
sort pushdown, then we wouldn't have to implement a custom optimizer.)
   
   Reversing entire row groups feels like a bad solution in general because the 
row groups can be extremely large depending on how the parquet file is written. 
Decoding page by page would be a great improvement, but 
[apache/arrow-rs#3922](https://github.com/apache/arrow-rs/issues/3922) calls 
out some practical difficulties with implementing this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Fast parquet order inversion [datafusion]

Reply via email to