Rajesh Balamohan created TEZ-1152:
-------------------------------------

             Summary: Optimize broadcast join for scalability
                 Key: TEZ-1152
                 URL: https://issues.apache.org/jira/browse/TEZ-1152
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Rajesh Balamohan


Two main issues for large queries using broadcast shuffle

1. Lots of tasks communicate to same node for downloading shuffle data. So most 
of the time, single machine will be overloaded with requests.

2. Tasks pertaining to same job (in the same machine) downloads broadcast 
shuffle data redundantly.  If the data can be copied to temp storage or ramfs, 
other tasks running in the same machine can refer to the local copy.  
Optimizing this would help when running multiple queries in parallel in the 
cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to