viirya commented on a change in pull request #28745:
URL: https://github.com/apache/spark/pull/28745#discussion_r436380321
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PandasGroupUtils.scala
##########
@@ -59,65 +59,65 @@ private[python] object PandasGroupUtils {
*/
def groupAndProject(
input: Iterator[InternalRow],
- groupingAttributes: Seq[Attribute],
+ groupingExprs: Seq[NamedExpression],
inputSchema: Seq[Attribute],
- dedupSchema: Seq[Attribute]): Iterator[(InternalRow,
Iterator[InternalRow])] = {
- val groupedIter = GroupedIterator(input, groupingAttributes, inputSchema)
+ dedupSchema: Seq[NamedExpression]): Iterator[(InternalRow,
Iterator[InternalRow])] = {
+ val groupedIter = GroupedIterator(input, groupingExprs, inputSchema)
val dedupProj = UnsafeProjection.create(dedupSchema, inputSchema)
groupedIter.map {
case (k, groupedRowIter) => (k, groupedRowIter.map(dedupProj))
}
}
/**
- * Returns a the deduplicated attributes of the spark plan and the arg
offsets of the
+ * Returns a the deduplicated named expressions of the spark plan and the
arg offsets of the
* keys and values.
*
- * The deduplicated attributes are needed because the spark plan may contain
an attribute
- * twice; once in the key and once in the value. For any such attribute we
need to
+ * The deduplicated expressions are needed because the spark plan may
contain an expression
+ * twice; once in the key and once in the value. For any such expression we
need to
* deduplicate.
*
- * The arg offsets are used to distinguish grouping grouping attributes and
data attributes
+ * The arg offsets are used to distinguish grouping expressions and data
expressions
* as following:
*
* argOffsets[0] is the length of the argOffsets array
*
- * argOffsets[1] is the length of grouping attribute
- * argOffsets[2 .. argOffsets[0]+2] is the arg offsets for grouping
attributes
+ * argOffsets[1] is the length of grouping expression
+ * argOffsets[2 .. argOffsets[0]+2] is the arg offsets for grouping
expressions
*
- * argOffsets[argOffsets[0]+2 .. ] is the arg offsets for data attributes
+ * argOffsets[argOffsets[0]+2 .. ] is the arg offsets for data expressions
*/
def resolveArgOffsets(
- child: SparkPlan, groupingAttributes: Seq[Attribute]): (Seq[Attribute],
Array[Int]) = {
+ dataExprs: Seq[NamedExpression], groupingExprs: Seq[NamedExpression])
+ : (Seq[NamedExpression], Array[Int]) = {
- val dataAttributes = child.output.drop(groupingAttributes.length)
- val groupingIndicesInData = groupingAttributes.map { attribute =>
- dataAttributes.indexWhere(attribute.semanticEquals)
+ val groupingIndicesInData = groupingExprs.map { expression =>
+ dataExprs.indexWhere(expression.semanticEquals)
}
Review comment:
I feel this looks not precisely correct at all cases. Seems `dataExprs`
are inputs to Python UDFs. Is it possible that `groupingExprs` are not just
child's outputs but expressions like `column + 1`?
In `RelationalGroupedDataset`, we added one projection previously to put
these grouping expressions with original child's outputs. Now we don't have it.
So can we always find semantically equal expr in `dataExprs` for a grouping
expression? `dataExprs` are input expressions in left/right plan for
`FlatMapCoGroupsInPandasExec`, so I guess we cannot find `column + 1` in it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]