Github user aray commented on the pull request:
https://github.com/apache/spark/pull/7841#issuecomment-150620321
@rxin here is my summary of other frameworks API's
I'm going to use an example dataset form the pandas doc for all the
examples (as df)
|A|B|C|D|
|---|---|---|---|
|foo|one|small|1|
|foo|one|large|2|
|foo|one|large|2|
|foo|two|small|3|
|foo|two|small|3|
|bar|one|large|4|
|bar|one|small|5|
|bar|two|small|6|
|bar|two|large|7|
This API
--------
```scala
scala> df.groupBy("A", "B").pivot("C", "small", "large").sum("D").show
+---+---+-----+-----+
| A| B|small|large|
+---+---+-----+-----+
|foo|two| 6| null|
|bar|two| 6| 7|
|foo|one| 1| 4|
|bar|one| 5| 4|
+---+---+-----+-----+
scala> df.groupBy("A", "B").pivot("C", "small", "large").agg(sum("D"),
avg("D")).show
+---+---+------------+------------+------------+------------+
| A| B|small sum(D)|small avg(D)|large sum(D)|large avg(D)|
+---+---+------------+------------+------------+------------+
|foo|two| 6| 3.0| null| null|
|bar|two| 6| 6.0| 7| 7.0|
|foo|one| 1| 1.0| 4| 2.0|
|bar|one| 5| 5.0| 4| 4.0|
+---+---+------------+------------+------------+------------+
scala> df.pivot(Seq($"A", $"B"), $"C", Seq("small", "large"),
sum($"D")).show
+---+---+-----+-----+
| A| B|small|large|
+---+---+-----+-----+
|foo|two| 6| null|
|bar|two| 6| 7|
|foo|one| 1| 4|
|bar|one| 5| 4|
+---+---+-----+-----+
```
We require a list of values for the pivot column as we are required to know
the output columns of the operator ahead of time. Pandas and reshape2 do not
require this but the comparable SQL operators do. We also allow multiple
aggregations which not all implementations allow.
pandas
------
The comparable metod for pandas is `pivot_table(data, values=None,
index=None, columns=None, aggfunc='mean', fill_value=None, margins=False,
dropna=True)`
Example
```python
>>> pivot_table(df, values='D', index=['A', 'B'], columns=['C'],
aggfunc=np.sum)
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7
```
Pandas also allows multiple aggregations:
```python
>>> pivot_table(df, values='D', index=['A', 'B'], columns=['C'],
aggfunc=[np.sum, np.average])
sum average
C large small large small
A B
bar one 4 5 4 5
two 7 6 7 6
foo one 4 1 2 1
two NaN 6 NaN 3
```
References
- http://pandas.pydata.org/pandas-docs/stable/reshaping.html
-
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
See also: `pivot`, `stack`, `unstack`.
reshape2 (R)
------------
The comparable method for reshape2 is `dcast(data, formula, fun.aggregate =
NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var =
guess_value(data))`
```r
> dcast(df, A + B ~ C, sum)
Using D as value column: use value.var to override.
A B large small
1 bar one 4 5
2 bar two 7 6
3 foo one 4 1
4 foo two 0 6
```
Note that by default cast fills with the value from applying fun.aggregate
to 0 length vector
References
- https://cran.r-project.org/web/packages/reshape2/reshape2.pdf
- http://seananderson.ca/2013/10/19/reshape.html
- http://www.inside-r.org/packages/cran/reshape2/docs/cast
See also: `melt`.
MS SQL Server
----------
```sql
SELECT *
FROM df
pivot (sum(D) for C in ([small], [large])) p
```
http://sqlfiddle.com/#!3/cf887/3/0
References
- http://sqlhints.com/2014/03/10/pivot-and-unpivot-in-sql-server/
Oracle 11g
----------
```sql
SELECT *
FROM df
pivot (sum(D) for C in ('small', 'large')) p
```
http://sqlfiddle.com/#!4/29bc5/3/0
Oracle also allows multiple aggregations and with similar output to this api
```sql
SELECT *
FROM df
pivot (sum(D) as sum, avg(D) as avg for C in ('small', 'large')) p
```
http://sqlfiddle.com/#!4/29bc5/5/0
References
- http://www.oracle.com/technetwork/articles/sql/11g-pivot-097235.html
-
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10002.htm#CHDCEJJE
- http://www.techonthenet.com/oracle/pivot.php
----------
Let me know if I can do anything else to help this along. Also would you
mind adding me to the jenkins whitelist so I can test it?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]