[
https://issues.apache.org/jira/browse/ARROW-15726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494767#comment-17494767
]
Jonathan Keane commented on ARROW-15726:
----------------------------------------
This might be coincidence (though I suspect now...) Our dataset benchmarks are
suddenly failing (with at least some of the failures being caused by attempting
to read data that shouldn't be being read [1])
Elena has helped narrow down the range of possible commits that this could have
happened in:
The first commit this might have happened in is
https://github.com/apache/arrow/commit/a935c81b595d24179e115d64cda944efa93aa0e0
and
https://github.com/apache/arrow/commit/afaa92e7e4289d6e4f302cc91810368794e8092b
it for sure happens in, so it was that commit or before.
Here's an example buildkite log:
https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/145#ebd7ea7a-3fad-4865-9e73-49200d89ddd6/6-3230
[1] this is a bit in the weeds, so bear with me: The dataset we use for these
benchmarks includes data in an early year that causes a schema failure
{{Unsupported cast from string to null using function cast_null}}. The
benchmarks that we wrote cleverly avoid selecting anything from this first year
(so if filter pushdown is working, we don't get the error). It _has_ been
working (for a while now! even with exec nodes), but suddenly about three days
ago, that has actually stopped working and the benchmarks started failing
> [R] Support push-down projection/filtering in datasets / dplyr
> --------------------------------------------------------------
>
> Key: ARROW-15726
> URL: https://issues.apache.org/jira/browse/ARROW-15726
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Weston Pace
> Priority: Major
>
> The following query should read a single column from the target parquet file.
> {noformat}
> open_dataset("lineitem.parquet") %>% select(l_tax) %>% filter(l_tax < 0.01)
> %>% collect()
> {noformat}
> Furthermore, it should apply a pushdown filter to the source node allowing
> parquet row groups to potentially filter out target data.
> At the moment it creates the following exec plan:
> {noformat}
> 3:SinkNode{}
> 2:ProjectNode{projection=[l_tax]}
> 1:FilterNode{filter=(l_tax < 0.01)}
> 0:SourceNode{}
> {noformat}
> There is no projection or filter in the source node. As a result we end up
> reading much more data from disk (the entire file) than we need to (at most a
> single column).
> This _could_ be fixed via heuristics in the dplyr code. However, it may
> quickly get complex (for example, the project comes after the filter, so you
> need to make sure you push down a projection that includes both the columns
> accessed by the filter and the columns accessed by the projection OR can you
> push down the projection through a join [yes you can], how do you know which
> columns to apply to which source node).
> A more complete fix would be to call into some kind of 3rd party optimizer
> (e.g. calcite) after the plan has been created by dplyr but before it is
> passed to the execution engine.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)