[
https://issues.apache.org/jira/browse/TEZ-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011041#comment-14011041
]
Rajesh Balamohan commented on TEZ-1152:
---------------------------------------
Since #1 and #2 need to be fixed independently, I will create sub tasks.
> Optimize broadcast join for scalability
> ---------------------------------------
>
> Key: TEZ-1152
> URL: https://issues.apache.org/jira/browse/TEZ-1152
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Labels: performance, scalability
>
> Two main issues for large queries using broadcast shuffle
> 1. Lots of tasks communicate to same node for downloading shuffle data. So
> most of the time, single machine will be overloaded with requests.
> 2. Tasks pertaining to same job (in the same machine) downloads broadcast
> shuffle data redundantly. If the data can be copied to temp storage or
> ramfs, other tasks running in the same machine can refer to the local copy.
> Optimizing this would help when running multiple queries in parallel in the
> cluster.
--
This message was sent by Atlassian JIRA
(v6.2#6252)