[ 
https://issues.apache.org/jira/browse/TEZ-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142858#comment-14142858
 ] 

Jeff Zhang commented on TEZ-1499:
---------------------------------

[~bikassaha]  I update the patch with the following changes:

* bq. Does this need the fix for local mode made in TEZ-1587?
  Fix it
* Fix the comment typo.
* Change names to HashJoinProcessor and SortMergeJoinProcessor and also add 
disclaimer.
* Regarding the difference between these 2 join algorithms, I add comment to 
Class doc and Processor doc. For the HashJoin it is required that keys in 
HashFile is unique while for SortMergeJoin it is required that keys in both 
datasets are unique.  And data which is generated by JoinDataGen are all 
unique, so it could be used for both of these 2 join algorithms in unit test.
* change mode to session mode in pipeline testcases

> Add SortMergeJoinExample to tez-examples
> ----------------------------------------
>
>                 Key: TEZ-1499
>                 URL: https://issues.apache.org/jira/browse/TEZ-1499
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: Tez-1499-2.patch, Tez-1499-3.patch, Tez-1499.patch
>
>
> In the current join example, the inputs of JoinProcessor is unordered so that 
> it will always need to load one input into memory, and stream another input. 
> This only fit for the case when one dataset is small enough to fit into 
> memory ( even use no-broadcast, memory may not be enough ).  So I'd like to 
> add another join example that make the inputs of JoinProcessor is ordered. ( 
> using OrderedPartitionedKVEdgeConfig ). This kind of join could been used 
> when both of the 2 datasets are large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to