Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20211#discussion_r160829658
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -457,13 +458,26 @@ class RelationalGroupedDataset protected[sql](
val groupingNamedExpressions = groupingExprs.map {
case ne: NamedExpression => ne
- case other => Alias(other, other.toString)()
+ case other => Alias(other, toPrettySQL(other))()
}
val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
val child = df.logicalPlan
val project = Project(groupingNamedExpressions ++ child.output, child)
- val output = expr.dataType.asInstanceOf[StructType].toAttributes
- val plan = FlatMapGroupsInPandas(groupingAttributes, expr, output,
project)
+ val udfOutput: Seq[Attribute] =
expr.dataType.asInstanceOf[StructType].toAttributes
+ val additionalGroupingAttributes = mutable.ArrayBuffer[Attribute]()
+
+ for (attribute <- groupingAttributes) {
+ if (!udfOutput.map(_.name).contains(attribute.name)) {
--- End diff --
@ueshin You brought up a very good point about an issue I struggle a bit
with - conflicting column names in grouping column and UDF output.
When this happens, we have a few choices:
1. Keep both columns and rename one of them
The benefit of this approach is that it gives the user the most
information, but might result in arbitrary column names such like `v_`. Also
another downside is if the UDF just adds or replace columns, this will result
duplicate columns.
2. Keep both columns and don't rename
This is consistent with groupby agg behavior, so probably better than
(1), but still, will result in duplicate columns.
3. Drop conflict group columns
This is the approach implemented in this PR. The reason I choose this
is because I think it's a rare case that the user want to change the grouping
column and at the same time, want the original grouping column. Therefore, I
think it makes most sense to make the user do a bit extra work - explicitly
create a another column rather than overriding the grouping column.
4. Drop conflict UDF columns
I don't think drop UDF output is reasonable behavior.
@ueshin which one do you prefer?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]