GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/13421
[SPARK-15680][SQL] Disable comments in generated code in order to avoid
perf. issues
## What changes were proposed in this pull request?
In benchmarks involving tables with very wide and complex schemas
(thousands of columns, deep nesting), I noticed that significant amounts of
time (order of tens of seconds per task) were being spent generating comments
during the code generation phase.
The root cause of the performance problem stems from the fact that calling
toString() on a complex expression can involve thousands of string
concatenations, resulting in huge amounts (tens of gigabytes) of character
array allocation and copying.
In the long term, we can avoid this problem by passing StringBuilders down
the tree and using them to accumulate output. As a short-term workaround, this
patch guards comment generation behind a flag and disables comments by default
(for wide tables / complex queries, these comments were being truncated prior
to display and thus were not very useful).
## How was this patch tested?
This was tested manually by running a Spark SQL query over an empty table
with a very wide schema obtained from a real workload. Disabling comments
brought the per-task time down from about 16 seconds to 600 milliseconds.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark
disable-line-comments-in-codegen
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13421.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13421
----
commit 0b6a190169ed0b16558c7d5fd3ba365a1b6795b9
Author: Josh Rosen <[email protected]>
Date: 2016-05-31T20:20:08Z
Use flag to disable comments in generated code.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]