[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

aray Fri, 23 Oct 2015 09:09:42 -0700

Github user aray commented on the pull request:

    https://github.com/apache/spark/pull/7841#issuecomment-150620321
  
    @rxin here is my summary of other frameworks API's
    
    I'm going to use an example dataset form the pandas doc for all the 
examples (as df)
    
    |A|B|C|D|
    |---|---|---|---|
    |foo|one|small|1|
    |foo|one|large|2|
    |foo|one|large|2|
    |foo|two|small|3|
    |foo|two|small|3|
    |bar|one|large|4|
    |bar|one|small|5|
    |bar|two|small|6|
    |bar|two|large|7|
    
    This API
    --------
    
    ```scala
    scala> df.groupBy("A", "B").pivot("C", "small", "large").sum("D").show
    +---+---+-----+-----+
    |  A|  B|small|large|
    +---+---+-----+-----+
    |foo|two|    6| null|
    |bar|two|    6|    7|
    |foo|one|    1|    4|
    |bar|one|    5|    4|
    +---+---+-----+-----+
    
    scala> df.groupBy("A", "B").pivot("C", "small", "large").agg(sum("D"), 
avg("D")).show
    +---+---+------------+------------+------------+------------+
    |  A|  B|small sum(D)|small avg(D)|large sum(D)|large avg(D)|
    +---+---+------------+------------+------------+------------+
    |foo|two|           6|         3.0|        null|        null|
    |bar|two|           6|         6.0|           7|         7.0|
    |foo|one|           1|         1.0|           4|         2.0|
    |bar|one|           5|         5.0|           4|         4.0|
    +---+---+------------+------------+------------+------------+
    
    scala> df.pivot(Seq($"A", $"B"), $"C", Seq("small", "large"), 
sum($"D")).show
    +---+---+-----+-----+
    |  A|  B|small|large|
    +---+---+-----+-----+
    |foo|two|    6| null|
    |bar|two|    6|    7|
    |foo|one|    1|    4|
    |bar|one|    5|    4|
    +---+---+-----+-----+
    ```
    
    We require a list of values for the pivot column as we are required to know 
the output columns of the operator ahead of time. Pandas and reshape2 do not 
require this but the comparable SQL operators do. We also allow multiple 
aggregations which not all implementations allow.
    
    pandas
    ------
    
    The comparable metod for pandas is `pivot_table(data, values=None, 
index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, 
dropna=True)`
    
    Example
    
    ```python
    >>> pivot_table(df, values='D', index=['A', 'B'], columns=['C'], 
aggfunc=np.sum)
              small  large
    foo  one  1      4
         two  6      NaN
    bar  one  5      4
         two  6      7
    ```
    
    Pandas also allows multiple aggregations:
    
    ```python
    >>> pivot_table(df, values='D', index=['A', 'B'], columns=['C'], 
aggfunc=[np.sum, np.average])
              sum       average      
    C       large small   large small
    A   B                            
    bar one     4     5       4     5
        two     7     6       7     6
    foo one     4     1       2     1
        two   NaN     6     NaN     3
    ```
    
    References
    
    - http://pandas.pydata.org/pandas-docs/stable/reshaping.html
    - 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
    
    See also: `pivot`, `stack`, `unstack`.
    
    reshape2 (R)
    ------------
    The comparable method for reshape2 is `dcast(data, formula, fun.aggregate = 
NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = 
guess_value(data))`
    
    ```r
    > dcast(df, A + B ~ C, sum)
    Using D as value column: use value.var to override.
        A   B large small
    1 bar one     4     5
    2 bar two     7     6
    3 foo one     4     1
    4 foo two     0     6
    ```
    
    Note that by default cast fills with the value from applying fun.aggregate 
to 0 length vector
    
    References
    
    - https://cran.r-project.org/web/packages/reshape2/reshape2.pdf
    - http://seananderson.ca/2013/10/19/reshape.html
    - http://www.inside-r.org/packages/cran/reshape2/docs/cast
    
    See also: `melt`.
    
    MS SQL Server
    ----------
    
    ```sql
    SELECT *
    FROM df
    pivot (sum(D) for C in ([small], [large])) p
    ```
    
    http://sqlfiddle.com/#!3/cf887/3/0
    
    References
    
    - http://sqlhints.com/2014/03/10/pivot-and-unpivot-in-sql-server/
    
    
    Oracle 11g
    ----------
    
    ```sql
    SELECT *
    FROM df
    pivot (sum(D) for C in ('small', 'large')) p
    ```
    http://sqlfiddle.com/#!4/29bc5/3/0
    
    Oracle also allows multiple aggregations and with similar output to this api
    
    ```sql
    SELECT *
    FROM df
    pivot (sum(D) as sum, avg(D) as avg for C in ('small', 'large')) p
    ```
    http://sqlfiddle.com/#!4/29bc5/5/0
    
    References
    
    - http://www.oracle.com/technetwork/articles/sql/11g-pivot-097235.html
    - 
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10002.htm#CHDCEJJE
    - http://www.techonthenet.com/oracle/pivot.php
    
    ----------
    
    Let me know if I can do anything else to help this along. Also would you 
mind adding me to the jenkins whitelist so I can test it?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

Reply via email to