[GitHub] [spark] HyukjinKwon commented on a change in pull request #28745: [SPARK-31915][SQL][PYTHON] Remove projection that adds grouping keys in grouped and cogrouped pandas UDFs

GitBox Sun, 07 Jun 2020 18:42:19 -0700


HyukjinKwon commented on a change in pull request #28745:
URL: https://github.com/apache/spark/pull/28745#discussion_r436424483




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PandasGroupUtils.scala
##########
@@ -59,65 +59,65 @@ private[python] object PandasGroupUtils {
    */
   def groupAndProject(
       input: Iterator[InternalRow],
-      groupingAttributes: Seq[Attribute],
+      groupingExprs: Seq[NamedExpression],
       inputSchema: Seq[Attribute],
-      dedupSchema: Seq[Attribute]): Iterator[(InternalRow, 
Iterator[InternalRow])] = {
-    val groupedIter = GroupedIterator(input, groupingAttributes, inputSchema)
+      dedupSchema: Seq[NamedExpression]): Iterator[(InternalRow, 
Iterator[InternalRow])] = {
+    val groupedIter = GroupedIterator(input, groupingExprs, inputSchema)
     val dedupProj = UnsafeProjection.create(dedupSchema, inputSchema)
     groupedIter.map {
       case (k, groupedRowIter) => (k, groupedRowIter.map(dedupProj))
     }
   }
 
   /**
-   * Returns a the deduplicated attributes of the spark plan and the arg 
offsets of the
+   * Returns a the deduplicated named expressions of the spark plan and the 
arg offsets of the
    * keys and values.
    *
-   * The deduplicated attributes are needed because the spark plan may contain 
an attribute
-   * twice; once in the key and once in the value.  For any such attribute we 
need to
+   * The deduplicated expressions are needed because the spark plan may 
contain an expression
+   * twice; once in the key and once in the value.  For any such expression we 
need to
    * deduplicate.
    *
-   * The arg offsets are used to distinguish grouping grouping attributes and 
data attributes
+   * The arg offsets are used to distinguish grouping expressions and data 
expressions
    * as following:
    *
    * argOffsets[0] is the length of the argOffsets array
    *
-   * argOffsets[1] is the length of grouping attribute
-   * argOffsets[2 .. argOffsets[0]+2] is the arg offsets for grouping 
attributes
+   * argOffsets[1] is the length of grouping expression
+   * argOffsets[2 .. argOffsets[0]+2] is the arg offsets for grouping 
expressions
    *
-   * argOffsets[argOffsets[0]+2 .. ] is the arg offsets for data attributes
+   * argOffsets[argOffsets[0]+2 .. ] is the arg offsets for data expressions
    */
   def resolveArgOffsets(
-    child: SparkPlan, groupingAttributes: Seq[Attribute]): (Seq[Attribute], 
Array[Int]) = {
+      dataExprs: Seq[NamedExpression], groupingExprs: Seq[NamedExpression])
+    : (Seq[NamedExpression], Array[Int]) = {
 
-    val dataAttributes = child.output.drop(groupingAttributes.length)
-    val groupingIndicesInData = groupingAttributes.map { attribute =>
-      dataAttributes.indexWhere(attribute.semanticEquals)
+    val groupingIndicesInData = groupingExprs.map { expression =>
+      dataExprs.indexWhere(expression.semanticEquals)
     }

Review comment:
       Actually the `groupingExprs` will be projected at 
https://github.com/apache/spark/pull/28745/files/2800eb238465547074498eae762199a53efc4277#diff-e7c34a6080e15837af82863db34fb1c4R66
 for the input iterator before the actual execution.
   
   The `groupingExprs` were already dropped in this code without this fix 
https://github.com/apache/spark/pull/28745/files/2800eb238465547074498eae762199a53efc4277#diff-e7c34a6080e15837af82863db34fb1c4L93
   
   I believe there's no difference virtually in the execution path here.
   
   For analysis,
   
   With this change: `groupingExprs` at `FlatMapGroupsInPandasExec`, for 
example, `column + 1`.  The attributes `column` inside `column + 1` will be 
properly resolved, and then it becomes an alias to project later during 
execution.
   
   Without this change: `Project`'s output contains the grouping expression as 
a separate attribute reference, `column + 1` (whereas the current fix keeps it 
as an expression). `FlatMapGroupsInPandasExec` contains the attribute reference 
as a grouping expression , and this grouping attribute will be used to project 
later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28745: [SPARK-31915][SQL][PYTHON] Remove projection that adds grouping keys in grouped and cogrouped pandas UDFs

Reply via email to