[ 
https://issues.apache.org/jira/browse/PIG-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590708#action_12590708
 ] 

Pi Song commented on PIG-211:
-----------------------------

These might be useful for you:-

1) What really happens in our Pig MapReduce execution engine is that all the 
records on both sides are separated into a number of buckets based on sort key. 
Then a local sort is used anyway as a part of Reduce (We can do this way 
because at the moment we only support equal join). Here the size of data in 
each bucket statistically will not be too big. Though, there could be some 
kinds of data skews. Possibly one way to help if some buckets are still too big 
is to use a second bucketing function to further slice into smaller buckets. A 
parameterized partitioner could be used as well but I don't think Hadoop 
currently supports it :(

2) One way we could do what you've suggested easily is to use a UDF that reads 
from the small table file. The small table file can be shipped to all the 
processing nodes using the mechanism similar to what we've got in Pig 
Streaming(See Pig Streaming SHIP in Pig Wiki). I really start to think that the 
SHIP construct should not be limited to Streaming.

This is a part of optimization work that hasn't started yet, though it's good 
that we've started a discussion. What about your opinion? Please keep giving us 
your ideas!!

> Replicating small tables for joins
> ----------------------------------
>
>                 Key: PIG-211
>                 URL: https://issues.apache.org/jira/browse/PIG-211
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>            Reporter: John DeTreville
>            Priority: Minor
>
> Joining a table A with a small table B can be disproportionately expensive if 
> A must be sorted before the join, and the result must be sorted again. This 
> effort can often be reduced or eliminated if table B is replicated in whole 
> to all nodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to