GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/17770
[SPARK-20392][SQL][WIP] Set barrier to prevent re-entering a tree
## What changes were proposed in this pull request?
It is reported that there is performance downgrade when applying ML
pipeline for dataset with many columns but few rows.
Currently I think the performance downgrade is caused by the cost of
exchange between DataFrame/Dataset abstraction and logical plans. Some
operations (e.g., `def select`) on DataFrames exchange between DataFrame
abstraction and logical plans. It can be ignored in the usage of SQL.
However, it's not rare to chain dozens of pipeline stages in ML. When the
query plan grows incrementally during running those stages, the cost spent on
the exchange grows too. In particular, the `Analyzer` will go through the big
query plan even most part of it is analyzed.
By eliminating part of the cost, the time to run the example code locally
is reduced from about 1min to about 30 secs.
In particular, the time applying the pipeline locally is mostly spent on
calling transform of the 137 `Bucketizer`s. Before the change, each call of
`Bucketizer`'s transform can cost about 0.4 sec. So the total time spent on all
`Bucketizer`s' transform is about 50 secs. After the change, each call only
costs about 0.1 sec.
We also make `boundEnc` as lazy variable to reduce unnecessary running time.
Note: the codes and datasets provided by Barry Becker to re-produce this
issue can be found on the JIRA.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 SPARK-20392
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17770.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17770
----
commit fe4483240d209fa6b7267e521fd81231462475de
Author: Liang-Chi Hsieh <[email protected]>
Date: 2017-04-26T08:53:51Z
Set barrier to prevent re-analysis of analyzed plan.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]