Weston Pace created ARROW-15726:
-----------------------------------

             Summary: [R] Support push-down projection/filtering in datasets / 
dplyr
                 Key: ARROW-15726
                 URL: https://issues.apache.org/jira/browse/ARROW-15726
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Weston Pace


The following query should read a single column from the target parquet file.

{noformat}
open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) %>% 
collect()
{noformat}

Furthermore, it should apply a pushdown filter to the source node allowing 
parquet row groups to potentially filter out target data.

At the moment it creates the following exec plan:

{noformat}
3:SinkNode{}
  2:ProjectNode{projection=[l_tax]}
    1:FilterNode{filter=(l_tax < 0.01)}
      0:SourceNode{}
{noformat}

There is no projection or filter in the source node.  As a result we end up 
reading much more data from disk (the entire file) than we need to (at most a 
single column).

This _could_ be fixed via heuristics in the dplyr code.  However, it may 
quickly get complex (for example, the project comes after the filter, so you 
need to make sure you push down a projection that includes both the columns 
accessed by the filter and the columns accessed by the projection OR can you 
push down the projection through a join [yes you can], how do you know which 
columns to apply to which source node).

A more complete fix would be to call into some kind of 3rd party optimizer 
(e.g. calcite) after the plan has been created by dplyr but before it is passed 
to the execution engine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to