[
https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-13797:
------------------------------------------
Labels: dataset orc (was: )
> [C++] Implement column projection pushdown to ORC reader in Datasets API
> ------------------------------------------------------------------------
>
> Key: ARROW-13797
> URL: https://issues.apache.org/jira/browse/ARROW-13797
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset, orc
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support
> for ORC file format in the Datasets API, but the reader still reads all
> columns regardless of the ScanOptions. Since ORC is a columnar format that
> supports reading only specific fields, we can optimize this step.
> The tricky part is to convert the field name of the Arrow schema to the index
> in the ORC schema. Currently, this logic is included in the Python bindings
> (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59),
> but so this needs to be moved to C++.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)