spark git commit: [SPARK-6748] [SQL] Makes QueryPlan.schema a lazy val

lian Tue, 07 Apr 2015 16:01:19 -0700

Repository: spark
Updated Branches:
  refs/heads/master fc957dc78 -> 77bcceb9f



[SPARK-6748] [SQL] Makes QueryPlan.schema a lazy val

`DataFrame.collect()` calls `SparkPlan.executeCollect()`, which consists of a 
single line:

```scala
execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()
```

The problem is that, `QueryPlan.schema` is a function. And since 1.3.0, 
`convertRowToScala` starts returning a `GenericRowWithSchema`. Thus, every 
`GenericRowWithSchema` instance holds a separate copy of the schema object. 
Also, YJP profiling result of the following simple micro benchmark (executed in 
Spark shell) shows that constructing the schema object takes up to ~35% CPU 
time.

```scala
sc.parallelize(1 to 10000000).
  map(i => (i, s"val_$i")).
  toDF("key", "value").
  saveAsParquetFile("file:///tmp/src.parquet")

// Profiling started from this line
sqlContext.parquetFile("file:///tmp/src.parquet").collect()
```

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png"; height=40 alt="Review on 
Reviewable"/>](https://reviewable.io/reviews/apache/spark/5398)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #5398 from liancheng/spark-6748 and squashes the following commits:

3159469 [Cheng Lian] Makes QueryPlan.schema a lazy val


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/77bcceb9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/77bcceb9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/77bcceb9

Branch: refs/heads/master
Commit: 77bcceb9f01e97cb6f41791f2167b40c4311f701
Parents: fc957dc
Author: Cheng Lian <[email protected]>
Authored: Wed Apr 8 07:00:56 2015 +0800
Committer: Cheng Lian <[email protected]>
Committed: Wed Apr 8 07:00:56 2015 +0800

----------------------------------------------------------------------
 .../main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/77bcceb9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
----------------------------------------------------------------------
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
index 02f7c26..7967189 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
@@ -150,7 +150,7 @@ abstract class QueryPlan[PlanType <: TreeNode[PlanType]] 
extends TreeNode[PlanTy
     }.toSeq
   }
 
-  def schema: StructType = StructType.fromAttributes(output)
+  lazy val schema: StructType = StructType.fromAttributes(output)
 
   /** Returns the output schema in the tree format. */
   def schemaString: String = schema.treeString


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-6748] [SQL] Makes QueryPlan.schema a lazy val

Reply via email to