> On Dec. 17, 2013, 3:52 p.m., Rohini Palaniswamy wrote: > > The approach is good when the replicate join is not the first vertex of the > > DAG (i.e in case of a MR, replicate join is part of a reduce). If it is the > > first vertex of the DAG, we need to compare and see that with this approach > > the performance does not regress with the MR's map only replicate join > > using distributed cache. Created PIG-3631 for follow up. > > Cheolsoo Park wrote: > Thank you Rohini for the review. I totally agree that we should measure > performance. > > But even in MR, you have two jobs - first one that loads a small table > and stage it on distributed cache, and second one that does join. In Tez, I > am replacing the first job with a vertex broadcasting a small table. So the > performance difference will be between copying a file to distributed cache vs > broadcasting it to downstream vertex. My assumption is that broadcasting is > fast since it doesn't have the sort phrase. Of course, I might be wrong. > > I will address your comments below shortly. Thank you!
Oh. I did not know. I assumed that it will be in a single map job if the script only contained direct join. I had seen "pigrepl_scope-xxxx" directories in distributed cache and thought that was when you had a reduce, but now I realize it comes from another job even if there is only join and nothing else. Even I am hoping that broadcast is better, but just need to compare and ensure we don't have a performance regression. I created that jira also to see if we can broadcast the serialized hashtable itself as Tez is supposed to be datatype agnostic. - Rohini ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16313/#review30533 ----------------------------------------------------------- On Dec. 17, 2013, 3:51 a.m., Cheolsoo Park wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/16313/ > ----------------------------------------------------------- > > (Updated Dec. 17, 2013, 3:51 a.m.) > > > Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini > Palaniswamy. > > > Bugs: PIG-3604 > https://issues.apache.org/jira/browse/PIG-3604 > > > Repository: pig-git > > > Description > ------- > > Implemented replicated join in Tez as follows: > - POFRJoinTez extends POFRJoin. The difference between two is that > replication hash table is constructed out of broadcasting edges in Tez > instead of files on distributed cache in MR. > - TezCompiler adds a vertex per replicated table and connect it to POFRJoin > vertex via broadcasting edge. > > Note that in POLocalRerrangeTez, I package tuples in the same way for > broadcast and scatter/gather edges, so I removed outputType > (DataMovementType). > > > Diffs > ----- > > > src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POFRJoin.java > d7c54d8 > > src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POMergeJoin.java > e900751 > src/org/apache/pig/backend/hadoop/executionengine/tez/POFRJoinTez.java > e69de29 > > src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java > cda5d89 > src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java > 7a1736a > src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java > 2584501 > test/e2e/pig/tests/tez.conf b280698 > test/org/apache/pig/test/data/GoldenFiles/TEZC10.gld e69de29 > test/org/apache/pig/tez/TestTezCompiler.java 79dc94e > > Diff: https://reviews.apache.org/r/16313/diff/ > > > Testing > ------- > > Added a unit test case to TestTezCompiler. > Added a e2e test case to Join. > > ant test-tez passes. > e2e test passes. > > > Thanks, > > Cheolsoo Park > >
