[GitHub] [arrow-rs] tustvold commented on issue #3922: Support reverse order for Parquet streams

via GitHub Thu, 23 Mar 2023 15:13:44 -0700


tustvold commented on issue #3922:
URL: https://github.com/apache/arrow-rs/issues/3922#issuecomment-1481983517


   > which would cause the data to be streamed in the reverse of its native 
order as well as individual record batches being reversed, respecting limits 
all the while
   
   Unfortunately the nature of the way parquet data is encoded would make doing 
this at anything below the row group level likely impractical for a couple of 
reasons:
   
   * With exception to PLAIN encoding, there is no easy way to decode pages in 
reverse order, as the underlying 
[encodings](https://github.com/apache/parquet-format/blob/master/Encodings.md) 
are length prefixed blocks
   * The dremel record shredding, especially for repetition levels, is order 
sensitive
   
   That being said, it is possible to just decode the last n rows using 
[`RowFilter`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html),
 potentially reducing this to a query optimisation problem in DataFusion, as 
opposed to something needing new functionality in the parquet crate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on issue #3922: Support reverse order for Parquet streams

Reply via email to