[
https://issues.apache.org/jira/browse/PIG-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olga Natkovich updated PIG-1792:
--------------------------------
Fix Version/s: (was: 0.10)
> Skewed Join Taking Too Long and Producing Too Much Data
> -------------------------------------------------------
>
> Key: PIG-1792
> URL: https://issues.apache.org/jira/browse/PIG-1792
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Ranjit Mathew
> Assignee: Thejas M Nair
>
> With Pig 0.8.0 and Hadoop 0.20, a skewed join takes too long and produces too
> much
> data.
> Using the data-generator from PIG-200, I generated two relations:
> --------------------------------- 8< ---------------------------------
> 3881312410 page_views
> 4370223 queryterm
> --------------------------------- 8< ---------------------------------
> (The first column represents the size in bytes of the relation in HDFS. So
> "page_views"
> was around 4,700 MiB and "queryterm" was around 4 MiB.)
> "queryterm" was generated from "page_views" using this Pig snippet:
> --------------------------------- 8< ---------------------------------
> pig << @EOF
> A = load 'page_views' using
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp,
> estimated_revenue, page_info,
> page_links);
> B = foreach A generate query_term;
> C = sample B 0.2;
> store C into 'queryterm';
> @EOF
> --------------------------------- 8< ---------------------------------
> To test skewed join, I used the following script:
> --------------------------------- 8< ---------------------------------
> A = load 'page_views' using
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> as (user, action, timespent, query_term, ip_addr, timestamp,
> estimated_revenue, page_info,
> page_links);
> B = load 'queryterm' as (query_term);
> C = join A by query_term, B by query_term using 'skewed' parallel 40;
> store C into 'L18out';
> --------------------------------- 8< ---------------------------------
> I had to abort this script after it had run for about 18.5 hours and had
> generated
> about 7 TiB of data. :-(
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira