Repository: spark
Updated Branches:
refs/heads/branch-2.0 ebbbf2136 -> 1371d5ece
[SPARK-15696][SQL] Improve `crosstab` to have a consistent column order
## What changes were proposed in this pull request?
Currently, `crosstab` returns a Dataframe having **random-order** columns
obtained by just `distinct`. Also, the documentation of `crosstab` shows the
result in a sorted order which is different from the current implementation.
This PR explicitly constructs the columns in a sorted order in order to improve
user experience. Also, this implementation gives the same result with the
documentation.
**Before**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3,
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value| 3| 2| 1|
+---------+---+---+---+
| 2| 1| 0| 2|
| 1| 0| 1| 1|
| 3| 1| 1| 0|
+---------+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2,
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key",
"value").show()
+---------+---+---+---+
|key_value| c| a| b|
+---------+---+---+---+
| 2| 1| 2| 0|
| 1| 0| 1| 1|
| 3| 1| 0| 1|
+---------+---+---+---+
```
**After**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3,
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value| 1| 2| 3|
+---------+---+---+---+
| 2| 2| 0| 1|
| 1| 1| 1| 0|
| 3| 0| 1| 1|
+---------+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2,
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key",
"value").show()
+---------+---+---+---+
|key_value| a| b| c|
+---------+---+---+---+
| 2| 2| 0| 1|
| 1| 1| 1| 0|
| 3| 0| 1| 1|
+---------+---+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with updated testcases.
Author: Dongjoon Hyun <[email protected]>
Closes #13436 from dongjoon-hyun/SPARK-15696.
(cherry picked from commit 5a3533e779d8e43ce0980203dfd3cbe343cc7d0a)
Signed-off-by: Reynold Xin <[email protected]>
Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1371d5ec
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1371d5ec
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1371d5ec
Branch: refs/heads/branch-2.0
Commit: 1371d5ecedb76d138dc9f431e5b40e36a58ed9ca
Parents: ebbbf21
Author: Dongjoon Hyun <[email protected]>
Authored: Thu Jun 9 22:46:51 2016 -0700
Committer: Reynold Xin <[email protected]>
Committed: Thu Jun 9 22:46:58 2016 -0700
----------------------------------------------------------------------
.../org/apache/spark/sql/execution/stat/StatFunctions.scala | 4 ++--
.../test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/spark/blob/1371d5ec/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
----------------------------------------------------------------------
diff --git
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
index 9c04061..ea58df7 100644
---
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
+++
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
@@ -423,9 +423,9 @@ private[sql] object StatFunctions extends Logging {
def cleanElement(element: Any): String = {
if (element == null) "null" else element.toString
}
- // get the distinct values of column 2, so that we can make them the
column names
+ // get the distinct sorted values of column 2, so that we can make them
the column names
val distinctCol2: Map[Any, Int] =
- counts.map(e => cleanElement(e.get(1))).distinct.zipWithIndex.toMap
+ counts.map(e =>
cleanElement(e.get(1))).distinct.sorted.zipWithIndex.toMap
val columnSize = distinctCol2.size
require(columnSize < 1e4, s"The number of distinct values for $col2, can't
" +
s"exceed 1e4. Currently $columnSize")
http://git-wip-us.apache.org/repos/asf/spark/blob/1371d5ec/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
----------------------------------------------------------------------
diff --git
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
index 1e8f106..0152f3f 100644
--- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
+++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
@@ -246,8 +246,8 @@ public class JavaDataFrameSuite {
Dataset<Row> crosstab = df.stat().crosstab("a", "b");
String[] columnNames = crosstab.schema().fieldNames();
Assert.assertEquals("a_b", columnNames[0]);
- Assert.assertEquals("2", columnNames[1]);
- Assert.assertEquals("1", columnNames[2]);
+ Assert.assertEquals("1", columnNames[1]);
+ Assert.assertEquals("2", columnNames[2]);
List<Row> rows = crosstab.collectAsList();
Collections.sort(rows, crosstabRowComparator);
Integer count = 1;
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]