[jira] [Resolved] (ARROW-15726) [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema

Weston Pace (Jira) Wed, 23 Feb 2022 10:32:05 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-15726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Weston Pace resolved ARROW-15726.
---------------------------------
    Fix Version/s: 8.0.0
       Resolution: Fixed

Issue resolved by pull request 12466
[https://github.com/apache/arrow/pull/12466]

> [C++] If a projected_schema is not supplied but a bound projection expression 
> is then we should use that to infer the projected_schema
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15726
>                 URL: https://issues.apache.org/jira/browse/ARROW-15726
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 8.0.0
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The following query should read a single column from the target parquet file.
> {noformat}
> open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) 
> %>% collect()
> {noformat}
> Furthermore, it should apply a pushdown filter to the source node allowing 
> parquet row groups to potentially filter out target data.
> At the moment it creates the following exec plan:
> {noformat}
> 3:SinkNode{}
>   2:ProjectNode{projection=[l_tax]}
>     1:FilterNode{filter=(l_tax < 0.01)}
>       0:SourceNode{}
> {noformat}
> There is no projection or filter in the source node.  As a result we end up 
> reading much more data from disk (the entire file) than we need to (at most a 
> single column).
> This _could_ be fixed via heuristics in the dplyr code.  However, it may 
> quickly get complex (for example, the project comes after the filter, so you 
> need to make sure you push down a projection that includes both the columns 
> accessed by the filter and the columns accessed by the projection OR can you 
> push down the projection through a join [yes you can], how do you know which 
> columns to apply to which source node).
> A more complete fix would be to call into some kind of 3rd party optimizer 
> (e.g. calcite) after the plan has been created by dplyr but before it is 
> passed to the execution engine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15726) [C++] If a projected_schema is not supplied but a bound projection expression is then we should use that to infer the projected_schema

Reply via email to