Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19787#discussion_r152195436
--- Diff: python/pyspark/sql/functions.py ---
@@ -2198,12 +2198,9 @@ def udf(f=None, returnType=StringType()):
duplicate invocations may be eliminated or the function may even
be invoked more times than
it is present in the query.
- .. note:: The user-defined functions do not support conditional
execution by using them with
- SQL conditional expressions such as `when` or `if`. The functions
still apply on all rows no
- matter the conditions are met or not. So the output is correct if
the functions can be
- correctly run on all rows without failure. If the functions can
cause runtime failure on the
- rows that do not satisfy the conditions, the suggested workaround
is to incorporate the
- condition logic into the functions.
+ .. note:: The user-defined functions do not support conditional
expressions or short curcuiting
+ in boolean expressions and it ends up with being executed all
internally. If the functions
+ can fail on special rows, the workaround is to incorporate the
condition into the functions.
--- End diff --
Hm .. actually doesn't the same thing apply to `pandas_udf` too? I was just
double checking:
```python
from pyspark.sql.functions import pandas_udf
def call1(b):
print "I am call1"
return b
def call2(b):
print "I am call2"
return b
bool1 = pandas_udf(call1, "boolean")
bool2 = pandas_udf(call2, "boolean")
spark.createDataFrame([[True]]).select(bool1("_1") |
bool2("_1")).explain(True)
spark.createDataFrame([[True]]).select(bool1("_1") | bool2("_1")).show()
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]