spark git commit: [SPARK-11876][SQL] Support printSchema in DataSet API

marmbrus Fri, 20 Nov 2015 11:21:12 -0800

Repository: spark
Updated Branches:
  refs/heads/master e359d5dcf -> bef361c58



[SPARK-11876][SQL] Support printSchema in DataSet API

DataSet APIs look great! However, I am lost when doing multiple level joins.  
For example,
```
val ds1 = Seq(("a", 1), ("b", 2)).toDS().as("a")
val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("b")
val ds3 = Seq(("a", 1), ("b", 2)).toDS().as("c")

ds1.joinWith(ds2, $"a._2" === $"b._2").as("ab").joinWith(ds3, $"ab._1._2" === 
$"c._2").printSchema()
```

The printed schema is like
```
root
 |-- _1: struct (nullable = true)
 |    |-- _1: struct (nullable = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: integer (nullable = true)
 |    |-- _2: struct (nullable = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: integer (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: integer (nullable = true)
```

Personally, I think we need the printSchema function. Sometimes, I do not know 
how to specify the column, especially when their data types are mixed. For 
example, if I want to write the following select for the above multi-level 
join, I have to know the schema:
```
newDS.select(expr("_1._2._2 + 1").as[Int]).collect()
```

marmbrus rxin cloud-fan  Do you have the same feeling?

Author: gatorsmile <[email protected]>

Closes #9855 from gatorsmile/printSchemaDataSet.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bef361c5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bef361c5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bef361c5

Branch: refs/heads/master
Commit: bef361c589c0a38740232fd8d0a45841e4fc969a
Parents: e359d5d
Author: gatorsmile <[email protected]>
Authored: Fri Nov 20 11:20:47 2015 -0800
Committer: Michael Armbrust <[email protected]>
Committed: Fri Nov 20 11:20:47 2015 -0800

----------------------------------------------------------------------
 .../src/main/scala/org/apache/spark/sql/DataFrame.scala     | 9 ---------
 .../scala/org/apache/spark/sql/execution/Queryable.scala    | 9 +++++++++
 2 files changed, 9 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/bef361c5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
index 9835812..7abceca 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
@@ -300,15 +300,6 @@ class DataFrame private[sql](
   def columns: Array[String] = schema.fields.map(_.name)
 
   /**
-   * Prints the schema to the console in a nice tree format.
-   * @group basic
-   * @since 1.3.0
-   */
-  // scalastyle:off println
-  def printSchema(): Unit = println(schema.treeString)
-  // scalastyle:on println
-
-  /**
    * Returns true if the `collect` and `take` methods can be run locally
    * (without any Spark executors).
    * @group basic

http://git-wip-us.apache.org/repos/asf/spark/blob/bef361c5/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala
index e86a52c..321e2c7 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala
@@ -38,6 +38,15 @@ private[sql] trait Queryable {
   }
 
   /**
+   * Prints the schema to the console in a nice tree format.
+   * @group basic
+   * @since 1.3.0
+   */
+  // scalastyle:off println
+  def printSchema(): Unit = println(schema.treeString)
+  // scalastyle:on println
+
+  /**
    * Prints the plans (logical and physical) to the console for debugging 
purposes.
    * @since 1.3.0
    */


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-11876][SQL] Support printSchema in DataSet API

Reply via email to