[ 
https://issues.apache.org/jira/browse/PIG-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935057#action_12935057
 ] 

Richard Ding commented on PIG-1743:
-----------------------------------

The data sets given here are too small for Pig to split keys into multiple 
reducers. Pig is smart enough to decide that there is no need for splitting the 
keys.

> Skewed join sampler generates unevenly partitioned data
> -------------------------------------------------------
>
>                 Key: PIG-1743
>                 URL: https://issues.apache.org/jira/browse/PIG-1743
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Viraj Bhat
>         Attachments: relation1.in, relation2.in
>
>
> I have a data, when using the Skewed join generated uneven partitions. The 
> script looks like this:
> {code}
> Data1 = LOAD '/user/viraj/relation1.in' AS (ref,intVal);
> Data2 = LOAD '/user/viraj/relation2.in' using PigStorage('\u0001') AS 
> (ID:chararray, Key:chararray, DomainKey:chararray);
> JoinData = JOIN Data1 BY ref LEFT OUTER , Data2 BY ID using 'skewed' PARALLEL 
> 10;
> STORE JoinData into 'skewedoutput' using PigStorage('\u0001');
> {code}
> The output generated has the following part files of varying sizes
> {quote}
> $ hadoop fs -ls /user/viraj/skewedoutput
> Found 10 items
> -rw-------   3 viraj users       2090 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00000
> -rw-------   3 viraj users      19380 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00001
> -rw-------   3 viraj users       2090 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00002
> -rw-------   3 viraj users       9690 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00003
> -rw-------   3 viraj users       2090 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00004
> -rw-------   3 viraj users       2090 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00005
> -rw-------   3 viraj users          0 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00006
> -rw-------   3 viraj users          0 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00007
> -rw-------   3 viraj users          0 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00008
> -rw-------   3 viraj users          0 2010-11-23 03:44 
> /user/viraj/skewedoutput/part-r-00009
> {quote}
> Attaching input datasets.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to