spark git commit: [SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues

rxin Tue, 31 May 2016 17:30:55 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 978f54e76 -> f0e8738c1



[SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. 
issues

## What changes were proposed in this pull request?

In benchmarks involving tables with very wide and complex schemas (thousands of 
columns, deep nesting), I noticed that significant amounts of time (order of 
tens of seconds per task) were being spent generating comments during the code 
generation phase.

The root cause of the performance problem stems from the fact that calling 
toString() on a complex expression can involve thousands of string 
concatenations, resulting in huge amounts (tens of gigabytes) of character 
array allocation and copying.

In the long term, we can avoid this problem by passing StringBuilders down the 
tree and using them to accumulate output. As a short-term workaround, this 
patch guards comment generation behind a flag and disables comments by default 
(for wide tables / complex queries, these comments were being truncated prior 
to display and thus were not very useful).

## How was this patch tested?

This was tested manually by running a Spark SQL query over an empty table with 
a very wide schema obtained from a real workload. Disabling comments brought 
the per-task time down from about 16 seconds to 600 milliseconds.

Author: Josh Rosen <[email protected]>

Closes #13421 from JoshRosen/disable-line-comments-in-codegen.

(cherry picked from commit 8ca01a6feb4935b1a3815cfbff1b90ccc6f60984)
Signed-off-by: Reynold Xin <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f0e8738c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f0e8738c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f0e8738c

Branch: refs/heads/branch-2.0
Commit: f0e8738c1ec0e4c5526aeada6f50cf76428f9afd
Parents: 978f54e
Author: Josh Rosen <[email protected]>
Authored: Tue May 31 17:30:03 2016 -0700
Committer: Reynold Xin <[email protected]>
Committed: Tue May 31 17:30:13 2016 -0700

----------------------------------------------------------------------
 .../expressions/codegen/CodeGenerator.scala     | 23 ++++++++++++++------
 1 file changed, 16 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/f0e8738c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
----------------------------------------------------------------------
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index 93e477e..9657f26 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -24,6 +24,7 @@ import com.google.common.cache.{CacheBuilder, CacheLoader}
 import org.codehaus.janino.ClassBodyEvaluator
 import scala.language.existentials
 
+import org.apache.spark.SparkEnv
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
@@ -724,15 +725,23 @@ class CodegenContext {
   /**
    * Register a comment and return the corresponding place holder
    */
-  def registerComment(text: String): String = {
-    val name = freshName("c")
-    val comment = if (text.contains("\n") || text.contains("\r")) {
-      text.split("(\r\n)|\r|\n").mkString("/**\n * ", "\n * ", "\n */")
+  def registerComment(text: => String): String = {
+    // By default, disable comments in generated code because computing the 
comments themselves can
+    // be extremely expensive in certain cases, such as deeply-nested 
expressions which operate over
+    // inputs with wide schemas. For more details on the performance issues 
that motivated this
+    // flat, see SPARK-15680.
+    if (SparkEnv.get != null && 
SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) {
+      val name = freshName("c")
+      val comment = if (text.contains("\n") || text.contains("\r")) {
+        text.split("(\r\n)|\r|\n").mkString("/**\n * ", "\n * ", "\n */")
+      } else {
+        s"// $text"
+      }
+      placeHolderToComments += (name -> comment)
+      s"/*$name*/"
     } else {
-      s"// $text"
+      ""
     }
-    placeHolderToComments += (name -> comment)
-    s"/*$name*/"
   }
 }
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues

Reply via email to