Weston Pace created ARROW-15726:
-----------------------------------
Summary: [R] Support push-down projection/filtering in datasets /
dplyr
Key: ARROW-15726
URL: https://issues.apache.org/jira/browse/ARROW-15726
Project: Apache Arrow
Issue Type: Improvement
Components: R
Reporter: Weston Pace
The following query should read a single column from the target parquet file.
{noformat}
open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01) %>%
collect()
{noformat}
Furthermore, it should apply a pushdown filter to the source node allowing
parquet row groups to potentially filter out target data.
At the moment it creates the following exec plan:
{noformat}
3:SinkNode{}
2:ProjectNode{projection=[l_tax]}
1:FilterNode{filter=(l_tax < 0.01)}
0:SourceNode{}
{noformat}
There is no projection or filter in the source node. As a result we end up
reading much more data from disk (the entire file) than we need to (at most a
single column).
This _could_ be fixed via heuristics in the dplyr code. However, it may
quickly get complex (for example, the project comes after the filter, so you
need to make sure you push down a projection that includes both the columns
accessed by the filter and the columns accessed by the projection OR can you
push down the projection through a join [yes you can], how do you know which
columns to apply to which source node).
A more complete fix would be to call into some kind of 3rd party optimizer
(e.g. calcite) after the plan has been created by dplyr but before it is passed
to the execution engine.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)