Github user dongjoon-hyun commented on the pull request:

    https://github.com/apache/spark/pull/13087#issuecomment-220721052
  
    For the documentation, sure, I will. We had better declare UDF should be 
deterministic as @thunterdb and you said from now.
    
    For the runtime, I don't have any benchmark for this situation. Maybe, 
@linbojin , the reporter of this issue, might has some. @linbojin , could you 
answer to the @marmbrus 's question?
    
    Generally speaking, this PR will prevent early pruning on all UDF filters 
for the semantics.
    - Case 1: If the selectivity of UDF filter is high, this will cause 
performance degradation.
    - Case 2: If the selectivity of UDF filter is low and the UDF takes long 
time, e.g. DB lookup or BLAS operation like `ALS.scala`, this will improve 
performance. As you see, the UDF will be called at every occurrence + 1.
    
     ```scala
    val filteredOnNewColumnDF = newDF.filter("new <> 'a1' and new <> 'a2' and 
new <> 'a3'")
    ```
    
    Although it depends, in most case, UDFs are assumed to have simple logic, 
and current `PushDownPredicate` is based on that assumption.
    
    If Spark wants to just keep the current way to avoid any performance 
change, that's also okay for me. :) 
    We can just declare explicitely that *ScalaUDF should be deterministic and 
might be called more than users declared*.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to