[
https://issues.apache.org/jira/browse/TEZ-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011543#comment-14011543
]
Bikas Saha commented on TEZ-1157:
---------------------------------
To be clear, this is related to but not entirely a broadcast problem. Broadcast
by itself means that the output will be accessible to all consumers. It does
not mean that the output will be read in its entirety by all consumers. In some
(maybe most) cases, the output will be read in its entirety. This is the
behavior of unsorted-kv-input when used in a broadcast edge. It reads all the
data from the producer which is bad in when its on the broadcast edge for
multiple consumers that run concurrently.
> Optimize broadcast :- Tasks pertaining to same job in same machine should not
> download multiple copies of broadcast data
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: TEZ-1157
> URL: https://issues.apache.org/jira/browse/TEZ-1157
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Labels: performance
> Attachments: TEZ-1152.WIP.patch
>
>
> Currently tasks (belonging to same job) running in the same machine download
> its own copy of broadcast data. Optimization could be to download one copy
> in the machine, and the rest of the tasks can refer to this downloaded copy.
--
This message was sent by Atlassian JIRA
(v6.2#6252)