[
https://issues.apache.org/jira/browse/TEZ-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142742#comment-14142742
]
Bikas Saha commented on TEZ-1499:
---------------------------------
Does this need the fix for local mode made in TEZ-1587?
{code}+ OrderedPartitionedKVEdgeConfig edgeConf =
+ OrderedPartitionedKVEdgeConfig.newBuilder(Text.class.getName(),
+ NullWritable.class.getName(), HashPartitioner.class.getName())
+ .build();{code}
typo - should this say "the other vertex" for inputVertex2?
{code}+ /**
+ * This vertex represents the one side of the join. It reads text data
using
{code}
Change names to HashJoinProcessor and SortMergeJoinProcessor?
Also to both join processors can we add a disclaimer saying that the join code
has been written as a tutorial for the APIs and not for performance.
If I am reading this correctly, there is a difference between the hashjoin and
sortmergejoin processors. The hash join reads A into a map and outputs values
from B where the B-value exists in the hash map. So if there are multiple
occurrences of the same B-value then all of them will be output. The
sort-merge-join processor seems to be matching the first common occurrence of
A-value and B-value but not other occurrences of B-value. Is that a correct
observation? If so, we should check if JoinDataGen and JoinValidate can break
due to that difference. Its fine for both join processors to behave differently
as long as its documented, though ideally they should behave the same.
Would it make sense for both the pipeline test cases to use session mode to
make the tests run a bit faster?
> Add SortMergeJoinExample to tez-examples
> ----------------------------------------
>
> Key: TEZ-1499
> URL: https://issues.apache.org/jira/browse/TEZ-1499
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: Tez-1499-2.patch, Tez-1499.patch
>
>
> In the current join example, the inputs of JoinProcessor is unordered so that
> it will always need to load one input into memory, and stream another input.
> This only fit for the case when one dataset is small enough to fit into
> memory ( even use no-broadcast, memory may not be enough ). So I'd like to
> add another join example that make the inputs of JoinProcessor is ordered. (
> using OrderedPartitionedKVEdgeConfig ). This kind of join could been used
> when both of the 2 datasets are large.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)