[
https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-13797:
------------------------------------------
Description:
ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support
for ORC file format in the Datasets API, but the reader still reads all columns
regardless of the ScanOptions. Since ORC is a columnar format that supports
reading only specific fields, we can optimize this step.
The tricky part is to convert the field name of the Arrow schema to the index
in the ORC schema. Currently, this logic is included in the Python bindings
(https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59),
but so this needs to be moved to C++.
> [C++] Implement column projection pushdown to ORC reader in Datasets API
> ------------------------------------------------------------------------
>
> Key: ARROW-13797
> URL: https://issues.apache.org/jira/browse/ARROW-13797
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support
> for ORC file format in the Datasets API, but the reader still reads all
> columns regardless of the ScanOptions. Since ORC is a columnar format that
> supports reading only specific fields, we can optimize this step.
> The tricky part is to convert the field name of the Arrow schema to the index
> in the ORC schema. Currently, this logic is included in the Python bindings
> (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59),
> but so this needs to be moved to C++.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)