EnricoMi commented on PR #39640:
URL: https://github.com/apache/spark/pull/39640#issuecomment-1398282605

   @cloud-fan following issue: `ds.groupByKey` adds key columns to the plan:
   
   ```
   def groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T] = {
     val withGroupingKey = AppendColumns(func, logicalPlan)
     val executed = sparkSession.sessionState.executePlan(withGroupingKey)
   
     new KeyValueGroupedDataset(
       encoderFor[K],
       encoderFor[T],
       executed,
       logicalPlan.output,
       withGroupingKey.newColumns)
   }
   ```
   
   Here, `[key#10, seq#11, value#12]` are the group value columns, whereas 
`[value#17]` represents the group key columns. User defined `$"value"` for 
group sorting, which cannot be resolved in this situation:
   ```
   'MapGroups [value#17], [key#10, seq#11, value#12], ['seq ASC NULLS FIRST, 
'length('key) ASC NULLS FIRST, 'value ASC NULLS FIRST], obj#19: java.lang.String
   +- AppendColumns [value#17]
      +- Project [_1#3 AS key#10, _2#4 AS seq#11, _3#5 AS value#12]
         +- LocalRelation [_1#3, _2#4, _3#5]
   ```
   
   The group sort columns should reference only the original value columns (not 
the `AppendColumns` column), we get an ambiguous reference otherwise:
   
   ```
   val ds = Seq(("a", 1, 10), ("a", 2, 20), ("b", 2, 1), ("b", 1, 2), ("c", 1, 
1))
     .toDF("key", "seq", "value")
   // groupByKey Row => String adds key columns `value` to the dataframe
   val grouped = ds.groupByKey(v => v.getString(0))
   // $"value" here is expected to not reference the key column
   val aggregated = grouped.flatMapSortedGroups($"seq", expr("length(key)"), 
$"value") {
     (g, iter) => Iterator(g, iter.mkString(", "))
   }
   ```
   
       [AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: 
[`value`, `value`].
   
   How can I modify the `dataOrder: Seq[SortOrder]` in such a way that 
`$"value"` is resolved against the `AppendColumns.child` / `[key#10, seq#11, 
value#12]`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to