Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22944#discussion_r231038560 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -262,25 +262,39 @@ object AppendColumns { def apply[T : Encoder, U : Encoder]( func: T => U, child: LogicalPlan): AppendColumns = { + val outputEncoder = encoderFor[U] + val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) + } else { + outputEncoder.namedExpressions + } new AppendColumns( func.asInstanceOf[Any => Any], implicitly[Encoder[T]].clsTag.runtimeClass, implicitly[Encoder[T]].schema, UnresolvedDeserializer(encoderFor[T].deserializer), - encoderFor[U].namedExpressions, + namedExpressions, child) } def apply[T : Encoder, U : Encoder]( func: T => U, inputAttributes: Seq[Attribute], child: LogicalPlan): AppendColumns = { + val outputEncoder = encoderFor[U] + val namedExpressions = if (!outputEncoder.isSerializedAsStruct) { + assert(outputEncoder.namedExpressions.length == 1) + outputEncoder.namedExpressions.map(Alias(_, "key")()) + } else { + outputEncoder.namedExpressions --- End diff -- I tried to add check into `CheckAnalysis` to detect such `AppendColumns`. I found that we have many such use cases in `DatasetSuite`, e.g. `map and group by with class data`: ```scala val ds: Dataset[(ClassData, Long)] = Seq(ClassData("one", 1), ClassData("two", 2)).toDS() .map(c => ClassData(c.a, c.b + 1)) .groupByKey(p => p).count() ``` If users don't access original output like this patch shows, it won't cause problem. So I'm thinking if we disallow it at all, it is a behavior change. Should we only fail the `groupByKey` query accessing ambiguous field names? Or we should disallow at all if there is any conflicting name?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org