[ 
https://issues.apache.org/jira/browse/ARROW-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-14503:
--------------------------------
    Description: 
The datasets API uses the RecordBatchFileReader to read feather files.  This 
reader will always "read" the entire file.  If the file is memory mapped this 
might not be a true read.  However, the datasets API never uses memory mapped 
files.

This large read from RAM (or worse, disk) becomes a bottleneck for simple 
queries that load only a few columns from the dataset.

The fix may be to modify the reader to seek out and pluck only the needed data. 
 Or the fix may be to modify the datasets API to use memory mapped files when 
possible (although the former approach seems more generally applicable).

This is related to ARROW-8250 but that issue seems more focused on row 
filtering while this issue is for column filtering.



  was:
The datasets API uses the RecordBatchFileReader to read feather files.  This 
reader will always "read" the entire file.  If the file is memory mapped this 
might not be a true read.  However, the datasets API never uses memory mapped 
files.

This large read from RAM (or worse, disk) becomes a bottleneck for simple 
queries that load only a few columns from the dataset.

The fix may be to modify the reader to use a seek out and pluck only the needed 
data.  Or the fix may be to modify the datasets API to use memory mapped files 
when possible (although the former approach seems more generally applicable).

This is related to ARROW-8250 but that issue seems more focused on row 
filtering while this issue is for column filtering.




> [C++][Dataset] Projection pushdown in IPC (feather) format
> ----------------------------------------------------------
>
>                 Key: ARROW-14503
>                 URL: https://issues.apache.org/jira/browse/ARROW-14503
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> The datasets API uses the RecordBatchFileReader to read feather files.  This 
> reader will always "read" the entire file.  If the file is memory mapped this 
> might not be a true read.  However, the datasets API never uses memory mapped 
> files.
> This large read from RAM (or worse, disk) becomes a bottleneck for simple 
> queries that load only a few columns from the dataset.
> The fix may be to modify the reader to seek out and pluck only the needed 
> data.  Or the fix may be to modify the datasets API to use memory mapped 
> files when possible (although the former approach seems more generally 
> applicable).
> This is related to ARROW-8250 but that issue seems more focused on row 
> filtering while this issue is for column filtering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to