[
https://issues.apache.org/jira/browse/ARROW-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177252#comment-17177252
]
Joris Van den Bossche commented on ARROW-9471:
----------------------------------------------
Another use case for a "reverse scan" is an efficient {{tail}} method (to
inspect the last rows of a dataset), which [~npr] added in R (in
https://github.com/apache/arrow/pull/7913, but which now needs to actually
traverse to all scan tasks to get the last one?)
> [C++] Scan Dataset in reverse
> -----------------------------
>
> Key: ARROW-9471
> URL: https://issues.apache.org/jira/browse/ARROW-9471
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Maarten Breddels
> Priority: Minor
>
> If a dataset does not fit into the OS cache, it can be beneficial to
> alternate between normal and reverse 'scanning'. Even if 90% of the a set of
> files fits into cache, scanning the same set twice will not make use of the
> OS cache. On the other hand, if the second time, scanning goes in reverse
> order, 90% will still be in OS cache. We use this trick in vaex, and I'd like
> to support that for parquet reading as well. (Is there a proper name/term for
> this?)
> Note that since you don't want to reverse on byte level, you may want to
> reverse the way of traversing the fragment, or fragment and row groups. Too
> small chunks (e.g. pages) could lead to a performance decrease because most
> read algorithms implement read-ahead optimization (not the reverse). I think
> doing this on fragment level might be enough.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)