Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/21650#discussion_r199022212
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala
---
@@ -94,36 +95,59 @@ object ExtractPythonUDFFromAggregate extends
Rule[LogicalPlan] {
*/
object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {
- private def hasPythonUDF(e: Expression): Boolean = {
+ private def hasScalarPythonUDF(e: Expression): Boolean = {
e.find(PythonUDF.isScalarPythonUDF).isDefined
}
- private def canEvaluateInPython(e: PythonUDF): Boolean = {
- e.children match {
- // single PythonUDF child could be chained and evaluated in Python
- case Seq(u: PythonUDF) => canEvaluateInPython(u)
- // Python UDF can't be evaluated directly in JVM
- case children => !children.exists(hasPythonUDF)
+ private def canEvaluateInPython(e: PythonUDF, evalType: Int): Boolean = {
+ if (e.evalType != evalType) {
+ false
+ } else {
+ e.children match {
+ // single PythonUDF child could be chained and evaluated in Python
+ case Seq(u: PythonUDF) => canEvaluateInPython(u, evalType)
+ // Python UDF can't be evaluated directly in JVM
+ case children => !children.exists(hasScalarPythonUDF)
+ }
}
}
- private def collectEvaluatableUDF(expr: Expression): Seq[PythonUDF] =
expr match {
- case udf: PythonUDF if PythonUDF.isScalarPythonUDF(udf) &&
canEvaluateInPython(udf) => Seq(udf)
- case e => e.children.flatMap(collectEvaluatableUDF)
+ private def collectEvaluableUDF(expr: Expression, evalType: Int):
Seq[PythonUDF] = expr match {
+ case udf: PythonUDF if PythonUDF.isScalarPythonUDF(udf) &&
canEvaluateInPython(udf, evalType) =>
+ Seq(udf)
+ case e => e.children.flatMap(collectEvaluableUDF(_, evalType))
+ }
+
+ /**
+ * Collect evaluable UDFs from the current node.
+ *
+ * This function collects Python UDFs or Scalar Python UDFs from
expressions of the input node,
+ * and returns a list of UDFs of the same eval type.
+ *
+ * If expressions contain both UDFs eval types, this function will only
return Python UDFs.
+ *
+ * The caller should call this function multiple times until all
evaluable UDFs are collected.
--- End diff --
So this will pipeline UDFs of the same eval type so that they can be
processed together in the same call to python worker?
For example if we have `pandas_udf, pandas_udf, udf, udf` then both
`pandas_udfs` will be sent together to the worker, then both `udfs` together -
python runner gets executed twice.
On the other hand, if we have `pandas_udf, udf, pandas_udf, udf` then each
one will have to be executed at a time, and python runner gets executed 4
times. Is that right?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]