[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...

ueshin Tue, 09 Jan 2018 23:40:00 -0800

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20211#discussion_r160605967
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
    @@ -457,13 +458,26 @@ class RelationalGroupedDataset protected[sql](
     
         val groupingNamedExpressions = groupingExprs.map {
           case ne: NamedExpression => ne
    -      case other => Alias(other, other.toString)()
    +      case other => Alias(other, toPrettySQL(other))()
         }
         val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
         val child = df.logicalPlan
         val project = Project(groupingNamedExpressions ++ child.output, child)
    -    val output = expr.dataType.asInstanceOf[StructType].toAttributes
    -    val plan = FlatMapGroupsInPandas(groupingAttributes, expr, output, 
project)
    +    val udfOutput: Seq[Attribute] = 
expr.dataType.asInstanceOf[StructType].toAttributes
    +    val additionalGroupingAttributes = mutable.ArrayBuffer[Attribute]()
    +
    +    for (attribute <- groupingAttributes) {
    +      if (!udfOutput.map(_.name).contains(attribute.name)) {
    --- End diff --
    
    I'm wondering whether we should decide the additional grouping attributes 
by only their names?
    
    For example from tests:
    
    ```python
    result3 = df.groupby('id', 'v').apply(foo).sort('id', 'v').toPandas()
    ```
    
    The column `v` in `result3` is not the actual grouping value, which is 
overwritten by the returned value from the UDF because the returned column name 
contains the name. I'm not sure it is the desired behavior.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...

Reply via email to