[
https://issues.apache.org/jira/browse/PIG-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590708#action_12590708
]
Pi Song commented on PIG-211:
-----------------------------
These might be useful for you:-
1) What really happens in our Pig MapReduce execution engine is that all the
records on both sides are separated into a number of buckets based on sort key.
Then a local sort is used anyway as a part of Reduce (We can do this way
because at the moment we only support equal join). Here the size of data in
each bucket statistically will not be too big. Though, there could be some
kinds of data skews. Possibly one way to help if some buckets are still too big
is to use a second bucketing function to further slice into smaller buckets. A
parameterized partitioner could be used as well but I don't think Hadoop
currently supports it :(
2) One way we could do what you've suggested easily is to use a UDF that reads
from the small table file. The small table file can be shipped to all the
processing nodes using the mechanism similar to what we've got in Pig
Streaming(See Pig Streaming SHIP in Pig Wiki). I really start to think that the
SHIP construct should not be limited to Streaming.
This is a part of optimization work that hasn't started yet, though it's good
that we've started a discussion. What about your opinion? Please keep giving us
your ideas!!
> Replicating small tables for joins
> ----------------------------------
>
> Key: PIG-211
> URL: https://issues.apache.org/jira/browse/PIG-211
> Project: Pig
> Issue Type: New Feature
> Components: data
> Reporter: John DeTreville
> Priority: Minor
>
> Joining a table A with a small table B can be disproportionately expensive if
> A must be sorted before the join, and the result must be sorted again. This
> effort can often be reduced or eliminated if table B is replicated in whole
> to all nodes.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.