[GitHub] spark issue #13483: [SPARK-15688][SQL] RelationalGroupedDataset.toDF should ...

dilipbiswal Sat, 04 Jun 2016 07:46:46 -0700

Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/13483
  
    @viirya This is a design decision. So far, both ways are not perfect. 
    
    In my mind, we have to consider the use cases here. If users want to have 
duplicate columns, they should not use the duplicate names. Do you think this 
makes sense? 
    
    That means, we should not remove the duplicate in the following scenario:
    ```
    df.groupBy("col1").agg($"col1".as("col1_replica"), count("*"))
    
    The expected output schema:
    +----+------------+--------+
    |col1|col1_replica|count(1)|
    +----+------------+--------+
    ```
    If they do not change the column name, I am unable to find any usage 
senario for duplicating the columns. 
    ```
    df.groupBy("col1").agg($"col1", count("*"))
    df.groupBy("col1").agg(count("*"))
    
    The expected output schema of the above two:
    +----+--------+
    |col1|count(1)|
    +----+--------+
    ```
    
    What is your opinions?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #13483: [SPARK-15688][SQL] RelationalGroupedDataset.toDF should ...

Reply via email to