Okay we did some filtering on the all_urls in the nested FOREACH and that seems 
to fix the performance issue.  There are still some mappers that get 8GB of 
data but the job went down to 2 hours.  

From Dimtry's reply sounds like the low memory handler output in the logs is 
misleading.

On May 6, 2010, at 3:30 PM, Corbin Hoenes wrote:

> Wondering if when we do a group like this:
> 
> grouped_urls_by_site = GROUP all_urls BY site;
> 
> if certain site has a lot of urls would they all have to be processed by the 
> same mapper (e.g. a single key?)  Could this account for why we have 8GB in 
> one map and not many in others?
> 
> On May 6, 2010, at 3:24 PM, Olga Natkovich wrote:
> 
>> Looks like attachments are not coming through. Here is the script from
>> Corbin inline.
>> 
>> One thing you might want to try is to switch your cogroups to skewed
>> join and see if that solves the issue:
>> 
>> http://hadoop.apache.org/pig/docs/r0.6.0/piglatin_ref1.html#Skewed+Joins
>> 
>> Olga
>> 
>> --------------------------------------------topurl.pig------------------
>> -------------------------------------------
>> set job.name 'Generate topurl reports for $out_file1'
>> 
>> %default dir_prefix '../..'
>> %default storage 'BinStorage()'
>> %default tynt_udfs 'tynt-udfs.jar'
>> %default topN '20'
>> /* default to 30 days time period so that alltime report will get
>> 14*30=420 min page views*/
>> %default timeperiod '30'
>> %default min_page_views_per_day '14'
>> 
>> register $dir_prefix/udfs/target/$tynt_udfs
>> register $dir_prefix/udfs/lib/piggybank.jar
>> 
>> ---------------------summarize address bar
>> stats-----------------------------------
>> addbar_stats = LOAD '$in_file1/addbarstats' USING $storage AS
>> (site:chararray, url:chararray, guid:chararray, cnt:long);
>> grouped_addbar_by_url = GROUP addbar_stats BY (site, url) PARALLEL 180;
>> addbar_stats_by_url = FOREACH grouped_addbar_by_url GENERATE
>> FLATTEN(group) AS (site, url), COUNT(addbar_stats) AS addbarcnt,
>> SUM(addbar_stats.cnt) AS addbarvisits; 
>> STORE addbar_stats_by_url INTO '$out_file1/addbarstatsbyurl' USING
>> $storage;
>> 
>> grouped_addbar_stats_by_site = GROUP addbar_stats_by_url BY site
>> PARALLEL 180;
>> addbar_stats_by_site = FOREACH grouped_addbar_stats_by_site GENERATE
>> group AS site, SUM(addbar_stats_by_url.addbarcnt) AS addbarcnt,
>> SUM(addbar_stats_by_url.addbarvisits) AS addbarvisits;
>> STORE addbar_stats_by_site INTO '$out_file1/addbarstatsbysite' USING
>> $storage;
>> 
>> ----------------------calculate
>> ratio------------------------------------------
>> clickstatsbyurl = LOAD '$in_file1/clickstatsbyurl' USING $storage AS
>> (site:chararray, url:chararray, cnt:long, tracecnt:long, tcnt:long,
>> pcnt:long, wcnt:long, utracecnt:long, utcnt:long, upcnt:long,
>> uwcnt:long);
>> viewstatsbyurl = LOAD '$in_file1/viewstatsbyurl' USING $storage AS
>> (site:chararray, url:chararray, title:chararray, cnt:long, etcnt:long,
>> et1cnt:long, et2cnt:long, et3cnt:long, et6cnt:long, et7cnt:long);
>> 
>> light_clickstatsbyurl = FOREACH clickstatsbyurl GENERATE site, url, cnt;
>> light_viewstatsbyurl_noisy = FOREACH viewstatsbyurl GENERATE site, url,
>> title, cnt, etcnt;
>> 
>> light_viewstatsbyurl = FILTER light_viewstatsbyurl_noisy BY url != '-';
>> 
>> --light_addbarstatsbyurl = FOREACH addbar_stats_by_url GENERATE site,
>> url, addbarvisits;
>> --joined_stats_for_ratio = COGROUP light_viewstatsbyurl BY (site, url)
>> INNER, light_clickstatsbyurl BY (site, url) OUTER,
>> light_addbarstatsbyurl BY (site, url) OUTER;
>> --flattened_stats_for_ratio = FOREACH joined_stats_for_ratio GENERATE
>> FLATTEN(light_viewstatsbyurl) AS (site, url, title, cnt, etcnt), 
>> --
>> (IsEmpty(light_clickstatsbyurl)?0:MAX(light_clickstatsbyurl.cnt)) as
>> clickcnt, 
>> --
>> (IsEmpty(light_addbarstatsbyurl)?0:MAX(light_addbarstatsbyurl.addbarvisi
>> ts)) as addbarcnt;
>> 
>> joined_stats_for_ratio = COGROUP light_viewstatsbyurl BY (site, url)
>> INNER, light_clickstatsbyurl BY (site, url) OUTER;
>> flattened_stats_for_ratio = FOREACH joined_stats_for_ratio GENERATE
>> FLATTEN(light_viewstatsbyurl) AS (site, url, title, cnt, etcnt),
>> 
>> (IsEmpty(light_clickstatsbyurl)?0:MAX(light_clickstatsbyurl.cnt)) as
>> clickcnt;
>> 
>> ratio_by_url = FOREACH flattened_stats_for_ratio 
>>                     {
>>                       generated_traffic = clickcnt+etcnt;
>>                       total_traffic = cnt;
>>                       ti =
>> ((float)(generated_traffic))/((float)total_traffic);
>>                       GENERATE site, url, title, ((ti>1)?(-ti):ti) AS
>> ratio, generated_traffic AS gviews, total_traffic AS views;
>>                     }
>> 
>> ------------------------combined with
>> #copies----------------------------------------
>> copystatsbyurl = LOAD '$in_file1/copystatsbyurl' USING $storage AS
>> (site:chararray, url:chararray, lcnt:long, scnt:long, icnt:long,
>> acnt:long);
>> light_copystatsbyurl = FOREACH copystatsbyurl GENERATE site, url,
>> lcnt+scnt+icnt AS cnt;
>> 
>> all_stats_by_url = COGROUP ratio_by_url BY (site, url) INNER,
>> light_copystatsbyurl BY (site, url) OUTER PARALLEL 62;
>> all_urls = FOREACH all_stats_by_url GENERATE FLATTEN(ratio_by_url) AS
>> (site, url, title, ratio, gviews, views),
>> (IsEmpty(light_copystatsbyurl)?0:MAX(light_copystatsbyurl.cnt)) as
>> copies;
>> 
>> grouped_urls_by_site = GROUP all_urls BY site;
>> 
>> top_ratios = FOREACH grouped_urls_by_site 
>>                 {
>>                   filtered_by_minpageviews = FILTER all_urls BY views
>>> ($min_page_views_per_day*$timeperiod);
>>                   order_by_ratio = ORDER filtered_by_minpageviews BY
>> ratio DESC;
>>                   top_by_ratio = LIMIT order_by_ratio $topN;
>>                   GENERATE group AS site, top_by_ratio.(url, title,
>> ratio, gviews, views, copies) AS tops;                   
>>                 }
>> 
>> top_gviews = FOREACH grouped_urls_by_site 
>>                 {
>>                   order_by_gviews = ORDER all_urls BY gviews DESC;
>>                   top_by_gviews = LIMIT order_by_gviews $topN;
>>                   GENERATE group AS site, top_by_gviews.(url, title,
>> ratio, gviews, views, copies) AS tops;                   
>>                 }
>> 
>> top_views = FOREACH grouped_urls_by_site 
>>                 {
>>                   order_by_views = ORDER all_urls BY views DESC;
>>                   top_by_views = LIMIT order_by_views $topN;
>>                   GENERATE group AS site, top_by_views.(url, title,
>> ratio, gviews, views, copies) AS tops;                   
>>                 }
>> 
>> top_copies = FOREACH grouped_urls_by_site 
>>                 {
>>                   order_by_copies = ORDER all_urls BY copies DESC;
>>                   top_by_copies = LIMIT order_by_copies $topN; 
>>                   GENERATE group AS site, top_by_copies.(url, title,
>> ratio, gviews, views, copies) AS tops;                   
>>                 }
>> 
>> grouped_tops = JOIN top_ratios BY site, top_gviews BY site, top_views BY
>> site, top_copies BY site;
>> 
>> top_urls = FOREACH grouped_tops GENERATE top_ratios::site AS site,
>> top_ratios::tops, top_gviews::tops, top_views::tops, top_copies::tops; 
>> 
>> store top_urls into '$out_file1/topurls' USING $storage;
>> 
>> 
>> 
>> -----Original Message-----
>> From: Corbin Hoenes [mailto:[email protected]] 
>> Sent: Thursday, May 06, 2010 11:57 AM
>> To: Olga Natkovich
>> Subject: Re: SpillableMemoryManager - low memory handler called
>> 
>> I have attached the script... please let me know if you have more
>> questions.  
>> 
>> 
>> On May 6, 2010, at 12:36 PM, Olga Natkovich wrote:
>> 
>>> This is just a warning saying that your job is spilling to the disk.
>>> Please, if you can, post a script that is causing this issue. In 0.6.0
>>> we moved large chunk of the code away from using SpillableMemoryManager
>>> but it is still used in some places. More changes are coming in 0.7.0 as
>>> well.
>>> 
>>> Olga
>>> 
>>> -----Original Message-----
>>> From: Corbin Hoenes [mailto:[email protected]] 
>>> Sent: Thursday, May 06, 2010 11:31 AM
>>> To: [email protected]
>>> Subject: Re: SpillableMemoryManager - low memory handler called
>>> 
>>> 0.6
>>> 
>>> Sent from my iPhone
>>> 
>>> On May 6, 2010, at 12:16 PM, "Olga Natkovich" <[email protected]>  
>>> wrote:
>>> 
>>>> Which version of Pig are you using?
>>>> 
>>>> -----Original Message-----
>>>> From: Corbin Hoenes [mailto:[email protected]]
>>>> Sent: Thursday, May 06, 2010 10:29 AM
>>>> To: [email protected]
>>>> Subject: SpillableMemoryManager - low memory handler called
>>>> 
>>>> Hi Piggers - Seeing an issue with a particular script where our job is
>>>> taking 6hrs 42min to complete.
>>>> 
>>>> syslogs are showing loads of these:
>>>> INFO : org.apache.pig.impl.util.SpillableMemoryManager - low memory
>>>> handler called (Usage threshold exceeded) init = 5439488(5312K) used =
>>>> 283443200(276800K) committed = 357957632(349568K) max =
>>>> 357957632(349568K)
>>>> INFO : org.apache.pig.impl.util.SpillableMemoryManager - low memory
>>>> handler called (Usage threshold exceeded) init = 5439488(5312K) used =
>>>> 267128840(260868K) committed = 357957632(349568K) max =
>>>> 357957632(349568K)
>>>> One iteresting thing is it's the map phase that is slow and one of the
>>>> mappers is getting 8GB of input while the other 2000 or so mappers are
>>>> getting MBs and hundreds of MBs of data.
>>>> 
>>>> Any where I can start looking?
>>>> 
>>>> 
>> 
> 

Reply via email to