[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...

icexelloss Wed, 10 Jan 2018 15:38:50 -0800

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20211#discussion_r160829658
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
    @@ -457,13 +458,26 @@ class RelationalGroupedDataset protected[sql](
     
         val groupingNamedExpressions = groupingExprs.map {
           case ne: NamedExpression => ne
    -      case other => Alias(other, other.toString)()
    +      case other => Alias(other, toPrettySQL(other))()
         }
         val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
         val child = df.logicalPlan
         val project = Project(groupingNamedExpressions ++ child.output, child)
    -    val output = expr.dataType.asInstanceOf[StructType].toAttributes
    -    val plan = FlatMapGroupsInPandas(groupingAttributes, expr, output, 
project)
    +    val udfOutput: Seq[Attribute] = 
expr.dataType.asInstanceOf[StructType].toAttributes
    +    val additionalGroupingAttributes = mutable.ArrayBuffer[Attribute]()
    +
    +    for (attribute <- groupingAttributes) {
    +      if (!udfOutput.map(_.name).contains(attribute.name)) {
    --- End diff --
    
    @ueshin You brought up a very good point about an issue I struggle a bit 
with - conflicting column names in grouping column and UDF output.
    
    When this happens, we have a few choices:
    1. Keep both columns and rename one of them
       The benefit of this approach is that it gives the user the most 
information, but might result in arbitrary column names such like `v_`. Also 
another downside is if the UDF just adds or replace columns, this will result 
duplicate columns.
    2. Keep both columns and don't rename
        This is consistent with groupby agg behavior, so probably better than 
(1), but still, will result in duplicate columns.
    3. Drop conflict group columns
        This is the approach implemented in this PR. The reason I choose this 
is because I think it's a rare case that the user want to change the grouping 
column and at the same time, want the original grouping column. Therefore, I 
think it makes most sense to make the user do a bit extra work - explicitly 
create a another column rather than overriding the grouping column.
    4. Drop conflict UDF columns
        I don't think drop UDF output is reasonable behavior.
    
    @ueshin which one do you prefer?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...

Reply via email to