[
https://issues.apache.org/jira/browse/TEZ-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142858#comment-14142858
]
Jeff Zhang commented on TEZ-1499:
---------------------------------
[~bikassaha] I update the patch with the following changes:
* bq. Does this need the fix for local mode made in TEZ-1587?
Fix it
* Fix the comment typo.
* Change names to HashJoinProcessor and SortMergeJoinProcessor and also add
disclaimer.
* Regarding the difference between these 2 join algorithms, I add comment to
Class doc and Processor doc. For the HashJoin it is required that keys in
HashFile is unique while for SortMergeJoin it is required that keys in both
datasets are unique. And data which is generated by JoinDataGen are all
unique, so it could be used for both of these 2 join algorithms in unit test.
* change mode to session mode in pipeline testcases
> Add SortMergeJoinExample to tez-examples
> ----------------------------------------
>
> Key: TEZ-1499
> URL: https://issues.apache.org/jira/browse/TEZ-1499
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: Tez-1499-2.patch, Tez-1499-3.patch, Tez-1499.patch
>
>
> In the current join example, the inputs of JoinProcessor is unordered so that
> it will always need to load one input into memory, and stream another input.
> This only fit for the case when one dataset is small enough to fit into
> memory ( even use no-broadcast, memory may not be enough ). So I'd like to
> add another join example that make the inputs of JoinProcessor is ordered. (
> using OrderedPartitionedKVEdgeConfig ). This kind of join could been used
> when both of the 2 datasets are large.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)