Weston Pace created ARROW-14503:
-----------------------------------
Summary: [C++][Dataset] Projection pushdown in IPC (feather) format
Key: ARROW-14503
URL: https://issues.apache.org/jira/browse/ARROW-14503
Project: Apache Arrow
Issue Type: New Feature
Components: C++
Reporter: Weston Pace
The datasets API uses the RecordBatchFileReader to read feather files. This
reader will always "read" the entire file. If the file is memory mapped this
might not be a true read. However, the datasets API never uses memory mapped
files.
This large read from RAM (or worse, disk) becomes a bottleneck for simple
queries that load only a few columns from the dataset.
The fix may be to modify the reader to use a seek out and pluck only the needed
data. Or the fix may be to modify the datasets API to use memory mapped files
when possible (although the former approach seems more generally applicable).
This is related to ARROW-8250 but that issue seems more focused on row
filtering while this issue is for column filtering.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)