[ https://issues.apache.org/jira/browse/PIG-5024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472907#comment-15472907 ]
liyunzhang_intel commented on PIG-5024: --------------------------------------- [~kexianda]: LGTM except some code style problem. > add a physical operator to broadcast small RDDs > ----------------------------------------------- > > Key: PIG-5024 > URL: https://issues.apache.org/jira/browse/PIG-5024 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: Xianda Ke > Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-5024.patch, PIG-5024_2.patch > > > Currently, when optimize some kinds of JOIN, the indexed or sampling files > are saved into HDFS. By setting the replication to a larger number, it serves > as distributed cache. > Spark's broadcast mechanism is suitable for this. It seems that we can add a > physical operator to broadcast small RDDs. > This will benefit the optimization of some specialized Joins, such as Skewed > Join, Replicated Join and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)