[GitHub] [arrow-rs] suremarc opened a new issue, #3922: Support reverse order for Parquet streams

via GitHub Thu, 23 Mar 2023 12:27:14 -0700


suremarc opened a new issue, #3922:
URL: https://github.com/apache/arrow-rs/issues/3922


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   
   I have been evaluating Parquet and Arrow Datafusion for use at my company. 
While testing out Datafusion I noticed that queries like `SELECT * FROM table 
ORDER BY field DESC LIMIT n` causes it to read the whole file, even though the 
existing data was sorted in ascending order. 
   
   Upon further investigation, this made sense because the Parquet reader can't 
return data in any order except the ordering that it was stored in. But this 
makes it hard to minimize the amount of work done while executing certain 
queries, e.g. get the last N events sorted by time before a certain known 
timestamp (especially for small N).
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   A new function added to `ArrowReaderBuilder`, something like this:
   ```rust
   pub fn with_reverse(self, reverse: bool) -> Self {
       Self { reverse, ..self }
   }
   ```
   
   which would cause the data to be streamed in the reverse of its native order 
as well as individual record batches being reversed, respecting limits all the 
while. E.g. `with_limit(100).with_reverse(true)` would return the last 100 rows 
satisfying the query. 
   
   Setting `with_reverse` should probably not affect the order of the row 
groups, since there are no guarantees on the organization of Parquet row groups 
anyway. 
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   After realizing that implementing this feature would be non-trivial, I tried 
implementing my own querying code by fetching a whole row group at a time, 
using the existing query builder, then reversing the entire row group. See 
[here](https://github.com/suremarc/polygon-arrow-rs/blob/master/src/main.rs). 
It works, but it has to deserialize the entire row group, even though the limit 
might be 1. A more sophisticated implementation would deserialize only the 
minimum number of pages before stopping early, as the existing code in the 
Parquet library does. 
   
   If the library had lower-level API it might be possible to support specific 
use cases like reverse ordering without overloading the existing logic (which 
is already quite complex by the looks of it). However I am not sure what such 
an API would look like. 
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] suremarc opened a new issue, #3922: Support reverse order for Parquet streams

Reply via email to