Skewed Join Taking Too Long and Producing Too Much Data
-------------------------------------------------------
Key: PIG-1792
URL: https://issues.apache.org/jira/browse/PIG-1792
Project: Pig
Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ranjit Mathew
With Pig 0.8.0 and Hadoop 0.20, a skewed join takes too long and produces too
much
data.
Using the data-generator from PIG-200, I generated two relations:
--------------------------------- 8< ---------------------------------
3881312410 page_views
4370223 queryterm
--------------------------------- 8< ---------------------------------
(The first column represents the size in bytes of the relation in HDFS. So
"page_views"
was around 4,700 MiB and "queryterm" was around 4 MiB.)
"queryterm" was generated from "page_views" using this Pig snippet:
--------------------------------- 8< ---------------------------------
pig << @EOF
A = load 'page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp,
estimated_revenue, page_info,
page_links);
B = foreach A generate query_term;
C = sample B 0.2;
store C into 'queryterm';
@EOF
--------------------------------- 8< ---------------------------------
To test skewed join, I used the following script:
--------------------------------- 8< ---------------------------------
A = load 'page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp,
estimated_revenue, page_info,
page_links);
B = load 'queryterm' as (query_term);
C = join A by query_term, B by query_term using 'skewed' parallel 40;
store C into 'L18out';
--------------------------------- 8< ---------------------------------
I had to abort this script after it had run for about 18.5 hours and had
generated
about 7 TiB of data. :-(
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.