Github user dongjoon-hyun commented on the pull request:
https://github.com/apache/spark/pull/13087#issuecomment-220721052
For the documentation, sure, I will. We had better declare UDF should be
deterministic as @thunterdb and you said from now.
For the runtime, I don't have any benchmark for this situation. Maybe,
@linbojin , the reporter of this issue, might has some. @linbojin , could you
answer to the @marmbrus 's question?
Generally speaking, this PR will prevent early pruning on all UDF filters
for the semantics.
- Case 1: If the selectivity of UDF filter is high, this will cause
performance degradation.
- Case 2: If the selectivity of UDF filter is low and the UDF takes long
time, e.g. DB lookup or BLAS operation like `ALS.scala`, this will improve
performance. As you see, the UDF will be called at every occurrence + 1.
```scala
val filteredOnNewColumnDF = newDF.filter("new <> 'a1' and new <> 'a2' and
new <> 'a3'")
```
Although it depends, in most case, UDFs are assumed to have simple logic,
and current `PushDownPredicate` is based on that assumption.
If Spark wants to just keep the current way to avoid any performance
change, that's also okay for me. :)
We can just declare explicitely that *ScalaUDF should be deterministic and
might be called more than users declared*.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]