[jira] [Updated] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API

Joris Van den Bossche (Jira) Mon, 30 Aug 2021 09:14:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-13797:
------------------------------------------
    Description: 
ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support 
for ORC file format in the Datasets API, but the reader still reads all columns 
regardless of the ScanOptions. Since ORC is a columnar format that supports 
reading only specific fields, we can optimize this step. 

The tricky part is to convert the field name of the Arrow schema to the index 
in the ORC schema. Currently, this logic is included in the Python bindings 
(https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59),
 but so this needs to be moved to C++.

> [C++] Implement column projection pushdown to ORC reader in Datasets API
> ------------------------------------------------------------------------
>
>                 Key: ARROW-13797
>                 URL: https://issues.apache.org/jira/browse/ARROW-13797
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support 
> for ORC file format in the Datasets API, but the reader still reads all 
> columns regardless of the ScanOptions. Since ORC is a columnar format that 
> supports reading only specific fields, we can optimize this step. 
> The tricky part is to convert the field name of the Arrow schema to the index 
> in the ORC schema. Currently, this logic is included in the Python bindings 
> (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59),
>  but so this needs to be moved to C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13797) [C++] Implement column projection pushdown to ORC reader in Datasets API

Reply via email to