GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/13421

    [SPARK-15680][SQL] Disable comments in generated code in order to avoid 
perf. issues

    ## What changes were proposed in this pull request?
    
    In benchmarks involving tables with very wide and complex schemas 
(thousands of columns, deep nesting), I noticed that significant amounts of 
time (order of tens of seconds per task) were being spent generating comments 
during the code generation phase.
    
    The root cause of the performance problem stems from the fact that calling 
toString() on a complex expression can involve thousands of string 
concatenations, resulting in huge amounts (tens of gigabytes) of character 
array allocation and copying.
    
    In the long term, we can avoid this problem by passing StringBuilders down 
the tree and using them to accumulate output. As a short-term workaround, this 
patch guards comment generation behind a flag and disables comments by default 
(for wide tables / complex queries, these comments were being truncated prior 
to display and thus were not very useful).
    
    ## How was this patch tested?
    
    This was tested manually by running a Spark SQL query over an empty table 
with a very wide schema obtained from a real workload. Disabling comments 
brought the per-task time down from about 16 seconds to 600 milliseconds.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark 
disable-line-comments-in-codegen

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13421
    
----
commit 0b6a190169ed0b16558c7d5fd3ba365a1b6795b9
Author: Josh Rosen <[email protected]>
Date:   2016-05-31T20:20:08Z

    Use flag to disable comments in generated code.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to