[
https://issues.apache.org/jira/browse/PIG-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Bhat updated PIG-1743:
----------------------------
Description:
I have a data, when using the Skewed join generated uneven partitions. The
script looks like this:
{code}
Data1 = LOAD '/user/viraj/relation1.in' AS (ref,intVal);
Data2 = LOAD '/user/viraj/relation2.in' using PigStorage('\u0001') AS
(ID:chararray, Key:chararray, DomainKey:chararray);
JoinData = JOIN Data1 BY ref LEFT OUTER , Data2 BY ID using 'skewed' PARALLEL
10;
STORE JoinData into 'skewedoutput' using PigStorage('\u0001');
{code}
The output generated has the following part files of varying sizes
{quote}
$ hadoop fs -ls /user/viraj/skewedoutput
Found 10 items
-rw------- 3 viraj users 2090 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00000
-rw------- 3 viraj users 19380 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00001
-rw------- 3 viraj users 2090 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00002
-rw------- 3 viraj users 9690 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00003
-rw------- 3 viraj users 2090 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00004
-rw------- 3 viraj users 2090 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00005
-rw------- 3 viraj users 0 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00006
-rw------- 3 viraj users 0 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00007
-rw------- 3 viraj users 0 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00008
-rw------- 3 viraj users 0 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00009
{quote}
Attaching input datasets.
Viraj
was:
I have a data, when using the Skewed join generated uneven partitions. The
script looks like this:
{script}
Data1 = LOAD '/user/viraj/relation1.in' AS (ref,intVal);
Data2 = LOAD '/user/viraj/relation2.in' using PigStorage('\u0001') AS
(ID:chararray, Key:chararray, DomainKey:chararray);
JoinData = JOIN Data1 BY ref LEFT OUTER , Data2 BY ID using 'skewed' PARALLEL
10;
STORE JoinData into 'skewedoutput' using PigStorage('\u0001');
{script}
The output generated has the following part files of varying sizes
{quote}
$ hadoop fs -ls /user/viraj/skewedoutput
Found 10 items
-rw------- 3 viraj users 2090 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00000
-rw------- 3 viraj users 19380 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00001
-rw------- 3 viraj users 2090 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00002
-rw------- 3 viraj users 9690 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00003
-rw------- 3 viraj users 2090 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00004
-rw------- 3 viraj users 2090 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00005
-rw------- 3 viraj users 0 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00006
-rw------- 3 viraj users 0 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00007
-rw------- 3 viraj users 0 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00008
-rw------- 3 viraj users 0 2010-11-23 03:44
/user/viraj/skewedoutput/part-r-00009
{quote}
Attaching input datasets.
Viraj
> Skewed join sampler generates unevenly partitioned data
> -------------------------------------------------------
>
> Key: PIG-1743
> URL: https://issues.apache.org/jira/browse/PIG-1743
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.7.0, 0.8.0
> Reporter: Viraj Bhat
>
> I have a data, when using the Skewed join generated uneven partitions. The
> script looks like this:
> {code}
> Data1 = LOAD '/user/viraj/relation1.in' AS (ref,intVal);
> Data2 = LOAD '/user/viraj/relation2.in' using PigStorage('\u0001') AS
> (ID:chararray, Key:chararray, DomainKey:chararray);
> JoinData = JOIN Data1 BY ref LEFT OUTER , Data2 BY ID using 'skewed' PARALLEL
> 10;
> STORE JoinData into 'skewedoutput' using PigStorage('\u0001');
> {code}
> The output generated has the following part files of varying sizes
> {quote}
> $ hadoop fs -ls /user/viraj/skewedoutput
> Found 10 items
> -rw------- 3 viraj users 2090 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00000
> -rw------- 3 viraj users 19380 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00001
> -rw------- 3 viraj users 2090 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00002
> -rw------- 3 viraj users 9690 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00003
> -rw------- 3 viraj users 2090 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00004
> -rw------- 3 viraj users 2090 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00005
> -rw------- 3 viraj users 0 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00006
> -rw------- 3 viraj users 0 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00007
> -rw------- 3 viraj users 0 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00008
> -rw------- 3 viraj users 0 2010-11-23 03:44
> /user/viraj/skewedoutput/part-r-00009
> {quote}
> Attaching input datasets.
> Viraj
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.