Lovasoa created SPARK-20939:
-------------------------------
Summary: Do not duplicate user-defined functions while optimizing
logical query plans
Key: SPARK-20939
URL: https://issues.apache.org/jira/browse/SPARK-20939
Project: Spark
Issue Type: Bug
Components: Optimizer, SQL
Affects Versions: 2.1.0
Reporter: Lovasoa
Priority: Minor
Currently, while optimizing a query plan, spark pushes filters down the query
plan tree, so that
Filter UDF(a)
+- Join Inner, (a = b)
+- Relation
+- Relation
becomes
Join Inner, (a = b)
+- Filter UDF(a)
+- Relation
+- Filter UDF(b)
+- Relation
In general, it is a good thing to push down filters as it reduces the number of
records that will go through the join.
However, in the case where the filter is an user-defined function (UDF), we
cannot know if the cost of executing the function twice will be higher than the
eventual cost of joining more elements or not.
So I think that the optimizer shouldn't move the user-defined function in the
query plan tree. The user will still be able to duplicate the function if he
wants to.
See this question on stackoverflow:
https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]