[ 
https://issues.apache.org/jira/browse/ARROW-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177252#comment-17177252
 ] 

Joris Van den Bossche commented on ARROW-9471:
----------------------------------------------

Another use case for a "reverse scan" is an efficient {{tail}} method (to 
inspect the last rows of a dataset), which [~npr] added in R (in 
https://github.com/apache/arrow/pull/7913, but which now needs to actually 
traverse to all scan tasks to get the last one?)

> [C++] Scan Dataset in reverse
> -----------------------------
>
>                 Key: ARROW-9471
>                 URL: https://issues.apache.org/jira/browse/ARROW-9471
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Maarten Breddels
>            Priority: Minor
>
> If a dataset does not fit into the OS cache, it can be beneficial to 
> alternate between normal and reverse 'scanning'. Even if 90% of the a set of 
> files fits into cache, scanning the same set twice will not make use of the 
> OS cache. On the other hand, if the second time, scanning goes in reverse 
> order, 90% will still be in OS cache. We use this trick in vaex, and I'd like 
> to support that for parquet reading as well. (Is there a proper name/term for 
> this?)
> Note that since you don't want to reverse on byte level, you may want to 
> reverse the way of traversing the fragment, or fragment and row groups. Too 
> small chunks (e.g. pages) could lead to a performance decrease because most 
> read algorithms implement read-ahead optimization (not the reverse). I think 
> doing this on fragment level might be enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to