[GitHub] [spark] skambha commented on issue #24593: [SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals

GitBox Tue, 14 May 2019 16:36:26 -0700

skambha commented on issue #24593: [SPARK-27692][SQL] Add new optimizer rule to 
evaluate the deterministic scala udf only once if all inputs are literals
URL: https://github.com/apache/spark/pull/24593#issuecomment-492448836
 
 
   Thanks everyone for your comments.  (@BryanCutler , @JoshRosen , @viirya, 
@gatorsmile , @mgaido91 , @dilipbiswal )  
   
   I wanted to summarize my understanding of the discussion so far and my 
comments. 
   
   In general, the optimization mentioned here to optimize the deterministic 
UDF for literals makes sense. 
   
   **Optimization:** 
   There are two ways to do the optimization that has been discussed: 
   1. New Optimization Rule  ‘DeterministicLiteralUDF’.    ie Changes are in 
the PR.   This can be disabled using the optimizer property.  
spark.sql.optimizer.excludedRules
   2. Implement ScalaUDF foldable  (As mentioned by JoshRosen) that will be 
consumed by the existing ConstantFolding optimizer rule.   We can also use a 
config property to enable or disable it.    
   
   Having this as a separate optimizer rule enables user to use the existing 
framework's property spark.sql.optimizer.excludedRules to exclude it.  
   
   **Concerns:** 
   1. Today, in Spark the UDFs by **default** are deterministic.   The users 
may not be aware of this as it is implicitly set by default. 
   2. This optimization is based on the ‘deterministic’ property of the UDF and 
the inputs are literals.  Users may not be aware of the deterministic property 
since it is set by default to true.   We do not want to do optimization because 
we are not sure if user fully understands the implication of the udf being 
deterministic
   
   **To address the concerns:** 
   - For  1: I can open a new issue to discuss whether the udf should be marked 
nondeterministic by default and user consciously marks it deterministic. 
   
   - For 2:  For this optimization based on deterministic property,  we create 
a new rule & have it under a config flag and by default make it **not** use 
this optimization.   This would address the concern 2.  
   
   
   **Documentation:** 
   In general, there is agreement that we should document in the migration 
guide to ensure that users are aware of this optimization and the implications 
behind it. 
   
   **Questions:** 
   Q1: Can we move forward with this PR under a flag that will disable the 
optimization by default?
   Q2: Do we want to hold off on this optimization till UDFs are marked 
nondeterministic by default? 
   ———
   
   Please let me know. Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] skambha commented on issue #24593: [SPARK-27692][SQL] Add new optimizer rule to evaluate the deterministic scala udf only once if all inputs are literals

Reply via email to