[jira] Updated: (PIG-554) Fragment Replicate Join

Pradeep Kamath (JIRA) Tue, 06 Jan 2009 18:04:15 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pradeep Kamath updated PIG-554:
-------------------------------

    Attachment: PIG-554-v3.patch

Changes in new patch submitted (PIG-554-v3.patch):
1) The code was not handling the case where the join key was "*" as illustrated 
in the script below:
{code}
a = load ... as (a:chararray, b:chararray);
b = load ... as (a:chararray, b);
c = join a by *, b by * using "replicated";
dump c;
{code}
In the above script the join column is a tuple whose second column in second 
input needs to be casted so that key types for both inputs match. For this, the 
ProjectStarTranslator should have an implementation for visit(LOFRJoin) so that 
the Project(*) is translated to multiple Project operations. After this 
translation, the type checking code will correctly decipher the join key to be 
a tuple and insert the necessary cast.
2) In POFRJoin, HashMap is used instead of HashTable to avoid any performance 
loss due to synchronization code in HashTable (HashMap is not synchronized). 
Also this HashMap has (tuple, DataBag) as Entries instead of the earlier 
(tuple, List<Tuple>) to avoid constructing bags out of the List in getNext()
3) Changed a couple of System.out.println() statements to log.debug()

> Fragment Replicate Join
> -----------------------
>
>                 Key: PIG-554
>                 URL: https://issues.apache.org/jira/browse/PIG-554
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: types_branch
>            Reporter: Shravan Matthur Narayanamurthy
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: frjofflat.patch, frjofflat1.patch, PIG-554-v3.patch
>
>
> Fragment Replicate Join(FRJ) is useful when we want a join between a huge 
> table and a very small table (fitting in memory small) and the join doesn't 
> expand the data by much. The idea is to distribute the processing of the huge 
> files by fragmenting it and replicating the small file to all machines 
> receiving a fragment of the huge file. Because of the availability of the 
> entire small file, the join becomes a trivial task without needing any break 
> in the pipeline. Exhaustive test have done to determine the improvement we 
> get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
> The patch makes changes to parts of the code where new operators are 
> introduced. Currently, when a new operator is introduced, its alias is not 
> set. For schema computation I have modified this behaviour to set the alias 
> of the new operator to that of its predecessor. The logical side of the patch 
> mimics the cogroup behavior as join syntax closely resembles that of cogroup. 
> Currently, this patch doesn't have support for joins other than inner joins. 
> The rest of the code has been documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-554) Fragment Replicate Join

Reply via email to