Holden Karau created SPARK-47672:
------------------------------------
Summary: Avoid double evaluation of non-trivial projected elements
from filter pushdown
Key: SPARK-47672
URL: https://issues.apache.org/jira/browse/SPARK-47672
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.5.1
Reporter: Holden Karau
Repro here [https://gist.github.com/holdenk/0f9660bcbd9e63aaff904f15d3439db1]
You can work around this by setting an expensive UDF to non-deterministic but
that's not ideal and won't fix expensive internal operations (like string
matching).
Instead when we go to bubble up a filter, if we should not move a filter up
above a projection of what we are filtering on.
https://issues.apache.org/jira/browse/SPARK-40045 partially fixed some of this
by (roughly) ordering filter expressions by cost so that we're not evaluating
more than ~2x (e.g. in old behavior bubbled up filter could become the first
elem of the filter and then the cheap null checks would go away and we'd have
expensive compute on everything not just filtered data), but we should "trust"
the users projection + later use of that projection to indicate that a UDF is
expensive and we should only evaluate it once inside of the projection and
filter after.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]