GitHub user MickDavies reopened a pull request:
https://github.com/apache/spark/pull/6673
[SPARK-8077][SQL] Optimization for TreeNodes with large numbers of children
For example large IN clauses
Large IN clauses are parsed very slowly. For example SQL below (10K items
in IN) takes 45-50s.
s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" +
_).mkString("','")}')"""
This is principally due to TreeNode which repeatedly call contains on
children, where children in this case is a List that is 10K long. In effect
parsing for large IN clauses is O(N squared).
A lazily initialised Set based on children for contains reduces parse time
to around 2.5s
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MickDavies/spark SPARK-8077
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6673.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6673
----
commit e6be8beb72936bb457343e6c9bd0dfddeede040f
Author: Michael Davies <[email protected]>
Date: 2015-06-05T18:02:15Z
SPARK-8077: Optimization for TreeNodes with large numbers of children
For example large IN clauses
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]