Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ebbbf2136 -> 1371d5ece


[SPARK-15696][SQL] Improve `crosstab` to have a consistent column order

## What changes were proposed in this pull request?

Currently, `crosstab` returns a Dataframe having **random-order** columns 
obtained by just `distinct`. Also, the documentation of `crosstab` shows the 
result in a sorted order which is different from the current implementation. 
This PR explicitly constructs the columns in a sorted order in order to improve 
user experience. Also, this implementation gives the same result with the 
documentation.

**Before**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value|  3|  2|  1|
+---------+---+---+---+
|        2|  1|  0|  2|
|        1|  0|  1|  1|
|        3|  1|  1|  0|
+---------+---+---+---+

scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
"value").show()
+---------+---+---+---+
|key_value|  c|  a|  b|
+---------+---+---+---+
|        2|  1|  2|  0|
|        1|  0|  1|  1|
|        3|  1|  0|  1|
+---------+---+---+---+
```

**After**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 
2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value|  1|  2|  3|
+---------+---+---+---+
|        2|  2|  0|  1|
|        1|  1|  1|  0|
|        3|  0|  1|  1|
+---------+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, 
"c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", 
"value").show()
+---------+---+---+---+
|key_value|  a|  b|  c|
+---------+---+---+---+
|        2|  2|  0|  1|
|        1|  1|  1|  0|
|        3|  0|  1|  1|
+---------+---+---+---+
```

## How was this patch tested?

Pass the Jenkins tests with updated testcases.

Author: Dongjoon Hyun <[email protected]>

Closes #13436 from dongjoon-hyun/SPARK-15696.

(cherry picked from commit 5a3533e779d8e43ce0980203dfd3cbe343cc7d0a)
Signed-off-by: Reynold Xin <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1371d5ec
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1371d5ec
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1371d5ec

Branch: refs/heads/branch-2.0
Commit: 1371d5ecedb76d138dc9f431e5b40e36a58ed9ca
Parents: ebbbf21
Author: Dongjoon Hyun <[email protected]>
Authored: Thu Jun 9 22:46:51 2016 -0700
Committer: Reynold Xin <[email protected]>
Committed: Thu Jun 9 22:46:58 2016 -0700

----------------------------------------------------------------------
 .../org/apache/spark/sql/execution/stat/StatFunctions.scala      | 4 ++--
 .../test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java  | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/1371d5ec/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
index 9c04061..ea58df7 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
@@ -423,9 +423,9 @@ private[sql] object StatFunctions extends Logging {
     def cleanElement(element: Any): String = {
       if (element == null) "null" else element.toString
     }
-    // get the distinct values of column 2, so that we can make them the 
column names
+    // get the distinct sorted values of column 2, so that we can make them 
the column names
     val distinctCol2: Map[Any, Int] =
-      counts.map(e => cleanElement(e.get(1))).distinct.zipWithIndex.toMap
+      counts.map(e => 
cleanElement(e.get(1))).distinct.sorted.zipWithIndex.toMap
     val columnSize = distinctCol2.size
     require(columnSize < 1e4, s"The number of distinct values for $col2, can't 
" +
       s"exceed 1e4. Currently $columnSize")

http://git-wip-us.apache.org/repos/asf/spark/blob/1371d5ec/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
----------------------------------------------------------------------
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java 
b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
index 1e8f106..0152f3f 100644
--- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
+++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
@@ -246,8 +246,8 @@ public class JavaDataFrameSuite {
     Dataset<Row> crosstab = df.stat().crosstab("a", "b");
     String[] columnNames = crosstab.schema().fieldNames();
     Assert.assertEquals("a_b", columnNames[0]);
-    Assert.assertEquals("2", columnNames[1]);
-    Assert.assertEquals("1", columnNames[2]);
+    Assert.assertEquals("1", columnNames[1]);
+    Assert.assertEquals("2", columnNames[2]);
     List<Row> rows = crosstab.collectAsList();
     Collections.sort(rows, crosstabRowComparator);
     Integer count = 1;


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to