Alan Gates commented on PIG-554:

A couple of questions:

1) I'm still not clear on why the additional maps are needed to load the 
replicated inputs into files.  Those inputs are already in files.  Are you 
somehow transforming them?  Isn't this exactly where we should be using the 
DistributedCache?  Rather than having map jobs that transform them I think the 
best thing would be to have the MRCompiler set a flag for the 
JobControlCompiler to load those files into the DC for this job.

2) You are using POLocalRearrange both in setting up the hash table and in 
reading the fragmented table before the join.  What benefit is being derived 
from this?  LR adds a lot of extra weight to the tuple that I don't think is 
needed.  I suspect we could fit more tuples into memory if we loaded them 
directly rather than using LR.

> Fragment Replicate Join
> -----------------------
>                 Key: PIG-554
>                 URL: https://issues.apache.org/jira/browse/PIG-554
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: types_branch
>            Reporter: Shravan Matthur Narayanamurthy
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>         Attachments: frjofflat.patch, frjofflat1.patch
> Fragment Replicate Join(FRJ) is useful when we want a join between a huge 
> table and a very small table (fitting in memory small) and the join doesn't 
> expand the data by much. The idea is to distribute the processing of the huge 
> files by fragmenting it and replicating the small file to all machines 
> receiving a fragment of the huge file. Because of the availability of the 
> entire small file, the join becomes a trivial task without needing any break 
> in the pipeline. Exhaustive test have done to determine the improvement we 
> get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
> The patch makes changes to parts of the code where new operators are 
> introduced. Currently, when a new operator is introduced, its alias is not 
> set. For schema computation I have modified this behaviour to set the alias 
> of the new operator to that of its predecessor. The logical side of the patch 
> mimics the cogroup behavior as join syntax closely resembles that of cogroup. 
> Currently, this patch doesn't have support for joins other than inner joins. 
> The rest of the code has been documented.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to