GitHub user mateiz opened a pull request:
https://github.com/apache/spark/pull/1990
[SPARK-3084] [SQL] Collect broadcasted tables in parallel in joins
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mateiz/spark spark-3084
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1990.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1990
----
commit f468766e2051f323ed81ecc53c27bed7becdc9b1
Author: Matei Zaharia <[email protected]>
Date: 2014-08-17T02:09:34Z
[SPARK-3084] Collect broadcasted tables in parallel in joins
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]