[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15518598#comment-15518598
 ] 

Yan commented on SPARK-17556:
-----------------------------

A few comments of mine are as follows:

1) The "one-executor collection" approach is different from the driver-side 
collection and broadcasting, in that it avoids uploading data from the driver 
back to cluster. The primary concern of the "one-executor collection" approach, 
as pointed out, is that the sole executor could get bottlenecked similar to the 
latency issue with the "driver-side collection" approach, to a large degree;
2) The "all-executor collection" approach is more balanced and scalable, but it 
might suffer from the network storming since all slaves needs to connect to all 
others.
3) the real issue is the repeated, and thus wasted, work of collection of 
pieces of the broadcast data by multiple collectors/broadcasters, against the 
extended latency if the collection/broadcasting is performed once and for all. 
This is actually not quite different from the scenario of multiple- vs 
single-reducer in a map/reduce execution. Final output from a single reducer is 
ready to use; while those from multiple-reducers require final assemblies by 
the end users, particularly if the final result is to be organized, e.g., 
totally ordered. But using multiple-reducers is more scalable, balanced and 
likely faster. 
4) It's probably good to have a configurable # of executors acting as 
collectors/broadcasters, each of which just collects and broadcasts a portion 
of the broadcast table for the final join executions.

> Executor side broadcast for broadcast joins
> -------------------------------------------
>
>                 Key: SPARK-17556
>                 URL: https://issues.apache.org/jira/browse/SPARK-17556
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core, SQL
>            Reporter: Reynold Xin
>         Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to