Maarten Breddels created ARROW-9471:
---------------------------------------
Summary: [C++] Scan Dataset in reverse
Key: ARROW-9471
URL: https://issues.apache.org/jira/browse/ARROW-9471
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Maarten Breddels
If a dataset does not fit into the OS cache, it can be beneficial to alternate
between normal and reverse 'scanning'. Even if 90% of the a set of files fits
into cache, scanning the same set twice will not make use of the OS cache. On
the other hand, if the second time, scanning goes in reverse order, 90% will
still be in OS cache. We use this trick in vaex, and I'd like to support that
for parquet reading as well. (Is there a proper name/term for this?)
Note that since you don't want to reverse on byte level, you may want to
reverse the way of traversing the fragment, or fragment and row groups. Too
small chunks (e.g. pages) could lead to a performance decrease because most
read algorithms implement read-ahead optimization (not the reverse). I think
doing this on fragment level might be enough.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)