Rajesh Balamohan created TEZ-1152:
-------------------------------------
Summary: Optimize broadcast join for scalability
Key: TEZ-1152
URL: https://issues.apache.org/jira/browse/TEZ-1152
Project: Apache Tez
Issue Type: Bug
Reporter: Rajesh Balamohan
Two main issues for large queries using broadcast shuffle
1. Lots of tasks communicate to same node for downloading shuffle data. So most
of the time, single machine will be overloaded with requests.
2. Tasks pertaining to same job (in the same machine) downloads broadcast
shuffle data redundantly. If the data can be copied to temp storage or ramfs,
other tasks running in the same machine can refer to the local copy.
Optimizing this would help when running multiple queries in parallel in the
cluster.
--
This message was sent by Atlassian JIRA
(v6.2#6252)