[jira] [Commented] (TEZ-1499) Add SortMergeJoinExample to tez-examples

Bikas Saha (JIRA) Sun, 21 Sep 2014 15:16:52 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142742#comment-14142742
 ]


Bikas Saha commented on TEZ-1499:
---------------------------------

Does this need the fix for local mode made in TEZ-1587?
{code}+    OrderedPartitionedKVEdgeConfig edgeConf =
+        OrderedPartitionedKVEdgeConfig.newBuilder(Text.class.getName(),
+            NullWritable.class.getName(), HashPartitioner.class.getName())
+            .build();{code}

typo - should this say "the other vertex" for inputVertex2?
{code}+    /**
+     * This vertex represents the one side of the join. It reads text data 
using
{code}

Change names to HashJoinProcessor and SortMergeJoinProcessor?
Also to both join processors can we add a disclaimer saying that the join code 
has been written as a tutorial for the APIs and not for performance.

If I am reading this correctly, there is a difference between the hashjoin and 
sortmergejoin processors. The hash join reads A into a map and outputs values 
from B where the B-value exists in the hash map. So if there are multiple 
occurrences of the same B-value then all of them will be output. The 
sort-merge-join processor seems to be matching the first common occurrence of 
A-value and B-value but not other occurrences of B-value. Is that a correct 
observation? If so, we should check if JoinDataGen and JoinValidate can break 
due to that difference. Its fine for both join processors to behave differently 
as long as its documented, though ideally they should behave the same.

Would it make sense for both the pipeline test cases to use session mode to 
make the tests run a bit faster?

> Add SortMergeJoinExample to tez-examples
> ----------------------------------------
>
>                 Key: TEZ-1499
>                 URL: https://issues.apache.org/jira/browse/TEZ-1499
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: Tez-1499-2.patch, Tez-1499.patch
>
>
> In the current join example, the inputs of JoinProcessor is unordered so that 
> it will always need to load one input into memory, and stream another input. 
> This only fit for the case when one dataset is small enough to fit into 
> memory ( even use no-broadcast, memory may not be enough ).  So I'd like to 
> add another join example that make the inputs of JoinProcessor is ordered. ( 
> using OrderedPartitionedKVEdgeConfig ). This kind of join could been used 
> when both of the 2 datasets are large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1499) Add SortMergeJoinExample to tez-examples

Reply via email to