suremarc opened a new issue, #3922:
URL: https://github.com/apache/arrow-rs/issues/3922
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
<!--
A clear and concise description of what the problem is. Ex. I'm always
frustrated when [...]
(This section helps Arrow developers understand the context and *why* for
this feature, in addition to the *what*)
-->
I have been evaluating Parquet and Arrow Datafusion for use at my company.
While testing out Datafusion I noticed that queries like `SELECT * FROM table
ORDER BY field DESC LIMIT n` causes it to read the whole file, even though the
existing data was sorted in ascending order.
Upon further investigation, this made sense because the Parquet reader can't
return data in any order except the ordering that it was stored in. But this
makes it hard to minimize the amount of work done while executing certain
queries, e.g. get the last N events sorted by time before a certain known
timestamp (especially for small N).
**Describe the solution you'd like**
<!--
A clear and concise description of what you want to happen.
-->
A new function added to `ArrowReaderBuilder`, something like this:
```rust
pub fn with_reverse(self, reverse: bool) -> Self {
Self { reverse, ..self }
}
```
which would cause the data to be streamed in the reverse of its native order
as well as individual record batches being reversed, respecting limits all the
while. E.g. `with_limit(100).with_reverse(true)` would return the last 100 rows
satisfying the query.
Setting `with_reverse` should probably not affect the order of the row
groups, since there are no guarantees on the organization of Parquet row groups
anyway.
**Describe alternatives you've considered**
<!--
A clear and concise description of any alternative solutions or features
you've considered.
-->
After realizing that implementing this feature would be non-trivial, I tried
implementing my own querying code by fetching a whole row group at a time,
using the existing query builder, then reversing the entire row group. See
[here](https://github.com/suremarc/polygon-arrow-rs/blob/master/src/main.rs).
It works, but it has to deserialize the entire row group, even though the limit
might be 1. A more sophisticated implementation would deserialize only the
minimum number of pages before stopping early, as the existing code in the
Parquet library does.
If the library had lower-level API it might be possible to support specific
use cases like reverse ordering without overloading the existing logic (which
is already quite complex by the looks of it). However I am not sure what such
an API would look like.
**Additional context**
<!--
Add any other context or screenshots about the feature request here.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]