[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

viirya Tue, 06 Nov 2018 00:51:02 -0800

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22944#discussion_r231038560
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
    @@ -262,25 +262,39 @@ object AppendColumns {
       def apply[T : Encoder, U : Encoder](
           func: T => U,
           child: LogicalPlan): AppendColumns = {
    +    val outputEncoder = encoderFor[U]
    +    val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
    +      assert(outputEncoder.namedExpressions.length == 1)
    +      outputEncoder.namedExpressions.map(Alias(_, "key")())
    +    } else {
    +      outputEncoder.namedExpressions
    +    }
         new AppendColumns(
           func.asInstanceOf[Any => Any],
           implicitly[Encoder[T]].clsTag.runtimeClass,
           implicitly[Encoder[T]].schema,
           UnresolvedDeserializer(encoderFor[T].deserializer),
    -      encoderFor[U].namedExpressions,
    +      namedExpressions,
           child)
       }
     
       def apply[T : Encoder, U : Encoder](
           func: T => U,
           inputAttributes: Seq[Attribute],
           child: LogicalPlan): AppendColumns = {
    +    val outputEncoder = encoderFor[U]
    +    val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
    +      assert(outputEncoder.namedExpressions.length == 1)
    +      outputEncoder.namedExpressions.map(Alias(_, "key")())
    +    } else {
    +      outputEncoder.namedExpressions
    --- End diff --
    
    I tried to add check into `CheckAnalysis` to detect such `AppendColumns`. I 
found that we have many such use cases in `DatasetSuite`, e.g. `map and group 
by with class data`:
    
    ```scala
    val ds: Dataset[(ClassData, Long)] = Seq(ClassData("one", 1), 
ClassData("two", 2)).toDS()
      .map(c => ClassData(c.a, c.b + 1))
      .groupByKey(p => p).count()
    ```
    
    If users don't access original output like this patch shows, it won't cause 
problem. So I'm thinking if we disallow it at all, it is a behavior change. 
Should we only fail the `groupByKey` query accessing ambiguous field names? Or 
we should disallow at all if there is any conflicting name?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

Reply via email to