[ 
https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13797:
------------------------------------------
        Parent: ARROW-13233
    Issue Type: Sub-task  (was: Improvement)

> [C++] Implement column projection pushdown to ORC reader in Datasets API
> ------------------------------------------------------------------------
>
>                 Key: ARROW-13797
>                 URL: https://issues.apache.org/jira/browse/ARROW-13797
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, orc
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support 
> for ORC file format in the Datasets API, but the reader still reads all 
> columns regardless of the ScanOptions. Since ORC is a columnar format that 
> supports reading only specific fields, we can optimize this step. 
> The tricky part is to convert the field name of the Arrow schema to the index 
> in the ORC schema. Currently, this logic is included in the Python bindings 
> (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59),
>  but so this needs to be moved to C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to