[jira] Commented: (PIG-554) Fragment Replicate Join

Olga Natkovich (JIRA) Wed, 03 Dec 2008 14:34:37 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653009#action_12653009
 ]


Olga Natkovich commented on PIG-554:
------------------------------------

I ran tests and they all passed.

Here are some comments on the patch:

(1) New files should include apache header
(2) LOFRJoin.getSchema(): I don't think nonDuplicates computation would work 
for more than two tables with the same column
(3) LOFRJoin.getTupleJoinColSchema(): has a comment saying:"This doesn't work 
with join by complex type". Does this that FRJ does not work with columns of 
type Tuple? According to Alan, tuple columns are supported in the case of 
regular join. I think it is ok if initial patch does not support it but we 
should probably have a separate JIRA to track this issue.
(4) In the grammar, you made "replicated" to be token. I thought we would make 
it a string so not to bloat the keyword space.
(5) I see that implementation seems to allow more than 2 tables but the test 
cases only cover 2 tables. I am fine if we initially only support 2 tables - I 
just wanted to clarify the intent here.
(6) Also, I ran explain on the following query and the results seems to have a 
separate map step that I was not sure about:

A = load '/user/pig/tests/data/singlefile/student_data' as (name, age, gpa);
B = load '/user/pig/tests/data/singlefile/student_data' as (name, age, gpa);
C = JOIN A by name, age B by name, age USING replicated;
explain C;

--------------------------------------------------
| Map Reduce Plan                                |
--------------------------------------------------
MapReduce node olgan-Wed Dec 03 14:21:35 PST 2008-57
Map Plan
Store(/tmp/temp921697735/tmp-320517577:org.apache.pig.builtin.BinStorage) - 
olgan-Wed Dec 03 14:21:35 PST 2008-58
|
|---Load(/user/pig/tests/data/singlefile/studenttab10k:org.apache.pig.builtin.PigStorage)
 - olgan-Wed Dec 03 14:21:35 PST 2008-44--------
Global sort: false
----------------
MapReduce node olgan-Wed Dec 03 14:21:35 PST 2008-56
Map Plan
Store(fakefile:org.apache.pig.builtin.PigStorage) - olgan-Wed Dec 03 14:21:35 
PST 2008-55
|
|---FRJoin[tuple] - olgan-Wed Dec 03 14:21:35 PST 2008-49
    |   |
    |   Project[bytearray][0] - olgan-Wed Dec 03 14:21:35 PST 2008-45
    |   |
    |   Project[bytearray][1] - olgan-Wed Dec 03 14:21:35 PST 2008-46
    |   |
    |   Project[bytearray][0] - olgan-Wed Dec 03 14:21:35 PST 2008-47
    |   |
    |   Project[bytearray][1] - olgan-Wed Dec 03 14:21:35 PST 2008-48
    |
    
|---Load(/user/pig/tests/data/singlefile/studenttab10k:org.apache.pig.builtin.PigStorage)
 - olgan-Wed Dec 03 14:21:35 PST 2008-43--------
Global sort: false

> Fragment Replicate Join
> -----------------------
>
>                 Key: PIG-554
>                 URL: https://issues.apache.org/jira/browse/PIG-554
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: types_branch
>            Reporter: Shravan Matthur Narayanamurthy
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: frjofflat.patch
>
>
> Fragment Replicate Join(FRJ) is useful when we want a join between a huge 
> table and a very small table (fitting in memory small) and the join doesn't 
> expand the data by much. The idea is to distribute the processing of the huge 
> files by fragmenting it and replicating the small file to all machines 
> receiving a fragment of the huge file. Because of the availability of the 
> entire small file, the join becomes a trivial task without needing any break 
> in the pipeline. Exhaustive test have done to determine the improvement we 
> get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
> The patch makes changes to parts of the code where new operators are 
> introduced. Currently, when a new operator is introduced, its alias is not 
> set. For schema computation I have modified this behaviour to set the alias 
> of the new operator to that of its predecessor. The logical side of the patch 
> mimics the cogroup behavior as join syntax closely resembles that of cogroup. 
> Currently, this patch doesn't have support for joins other than inner joins. 
> The rest of the code has been documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-554) Fragment Replicate Join

Reply via email to