icexelloss commented on a change in pull request #22305:
[SPARK-24561][SQL][Python] User-defined window aggregation functions with
Pandas UDF (bounded window)
URL: https://github.com/apache/spark/pull/22305#discussion_r240686604
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala
##########
@@ -27,17 +27,62 @@ import
org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.plans.physical._
-import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan,
UnaryExecNode}
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples,
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{ExternalAppendOnlyUnsafeRowArray,
SparkPlan}
import org.apache.spark.sql.execution.arrow.ArrowUtils
-import org.apache.spark.sql.types.{DataType, StructField, StructType}
+import org.apache.spark.sql.execution.window._
+import org.apache.spark.sql.types._
import org.apache.spark.util.Utils
+/**
+ * This class calculates and outputs windowed aggregates over the rows in a
single partition.
+ *
+ * This is similar to [[WindowExec]]. The main difference is that this node
doesn't not compute
+ * any window aggregation values. Instead, it computes the lower and upper
bound for each window
+ * (i.e. window bounds) and pass the data and indices to python work to do the
actual window
+ * aggregation.
+ *
+ * It currently materializes all data associated with the same partition key
and passes them to
+ * Python worker. This is not strictly necessary for sliding windows and can
be improved (by
+ * possibly slicing data into overlapping chunks and stitch them together).
Review comment:
Fixed
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]