> On Dec. 17, 2013, 3:52 p.m., Rohini Palaniswamy wrote:
> > The approach is good when the replicate join is not the first vertex of the 
> > DAG (i.e in case of a MR, replicate join is part of a reduce). If it is the 
> > first vertex of the DAG, we need to compare and see that with this approach 
> > the performance does not regress with the MR's map only replicate join 
> > using distributed cache. Created PIG-3631 for follow up.
> 
> Cheolsoo Park wrote:
>     Thank you Rohini for the review. I totally agree that we should measure 
> performance.
>     
>     But even in MR, you have two jobs - first one that loads a small table 
> and stage it on distributed cache, and second one that does join. In Tez, I 
> am replacing the first job with a vertex broadcasting a small table. So the 
> performance difference will be between copying a file to distributed cache vs 
> broadcasting it to downstream vertex. My assumption is that broadcasting is 
> fast since it doesn't have the sort phrase. Of course, I might be wrong.
>     
>     I will address your comments below shortly. Thank you!

Oh. I did not know. I assumed that it will be in a single map job if the script 
only contained direct join. I had seen "pigrepl_scope-xxxx" directories in 
distributed cache and thought that was when you had a reduce, but now I realize 
it comes from another job even if there is only join and nothing else. Even I 
am hoping that broadcast is better, but just need to compare and ensure we 
don't have a performance regression. I created that jira also to see if we can 
broadcast the serialized hashtable itself as Tez is supposed to be datatype 
agnostic. 


- Rohini


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16313/#review30533
-----------------------------------------------------------


On Dec. 17, 2013, 3:51 a.m., Cheolsoo Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16313/
> -----------------------------------------------------------
> 
> (Updated Dec. 17, 2013, 3:51 a.m.)
> 
> 
> Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini 
> Palaniswamy.
> 
> 
> Bugs: PIG-3604
>     https://issues.apache.org/jira/browse/PIG-3604
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Implemented replicated join in Tez as follows:
> - POFRJoinTez extends POFRJoin. The difference between two is that 
> replication hash table is constructed out of broadcasting edges in Tez 
> instead of files on distributed cache in MR.
> - TezCompiler adds a vertex per replicated table and connect it to POFRJoin 
> vertex via broadcasting edge.
> 
> Note that in POLocalRerrangeTez, I package tuples in the same way for 
> broadcast and scatter/gather edges, so I removed outputType 
> (DataMovementType). 
> 
> 
> Diffs
> -----
> 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POFRJoin.java
>  d7c54d8 
>   
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POMergeJoin.java
>  e900751 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POFRJoinTez.java 
> e69de29 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java
>  cda5d89 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java 
> 7a1736a 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 
> 2584501 
>   test/e2e/pig/tests/tez.conf b280698 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC10.gld e69de29 
>   test/org/apache/pig/tez/TestTezCompiler.java 79dc94e 
> 
> Diff: https://reviews.apache.org/r/16313/diff/
> 
> 
> Testing
> -------
> 
> Added a unit test case to TestTezCompiler.
> Added a e2e test case to Join.
> 
> ant test-tez passes.
> e2e test passes.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>

Reply via email to