Maarten Breddels created ARROW-9471:
---------------------------------------

             Summary: [C++] Scan Dataset in reverse
                 Key: ARROW-9471
                 URL: https://issues.apache.org/jira/browse/ARROW-9471
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Maarten Breddels


If a dataset does not fit into the OS cache, it can be beneficial to alternate 
between normal and reverse 'scanning'. Even if 90% of the a set of files fits 
into cache, scanning the same set twice will not make use of the OS cache. On 
the other hand, if the second time, scanning goes in reverse order, 90% will 
still be in OS cache. We use this trick in vaex, and I'd like to support that 
for parquet reading as well. (Is there a proper name/term for this?)

Note that since you don't want to reverse on byte level, you may want to 
reverse the way of traversing the fragment, or fragment and row groups. Too 
small chunks (e.g. pages) could lead to a performance decrease because most 
read algorithms implement read-ahead optimization (not the reverse). I think 
doing this on fragment level might be enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to