[ 
https://issues.apache.org/jira/browse/PIG-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693532#action_12693532
 ] 

Tamir Kamara edited comment on PIG-685 at 3/29/09 2:13 AM:
-----------------------------------------------------------

I've replaced 1000 with 10 in the Distinct.java file (lines 129 & 148) and 
still mappers are failing because of failure to report for 600 seconds. There's 
also, a heap space error on some mappers (same as before).

By the way, if I use the same script with no COUNT (i.e. GENERATE group as org, 
r6) the mappers are all finishing just fine, but the reducers are failing due 
to GC overhead exceeded. 
I'm running my tasks with 1024MB.


      was (Author: tamirk):
    I've replaced 1000 with 10 in the Distinct.java file (lines 129 & 148) and 
still mappers are failing because of failure to report for 600 seconds. There's 
also, a heap space error on some mappers (same as before).

By the way, if I use the same script with no COUNT (i.e. GENERATE group as org, 
r6;) the mappers are all finishing just fine, but the reducers are failing due 
to GC overhead exceeded. 
I'm running my tasks with 1024MB.

  
> Distinct UDF progress reports
> -----------------------------
>
>                 Key: PIG-685
>                 URL: https://issues.apache.org/jira/browse/PIG-685
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>         Environment: Hadoop 0.18.3 on redhat, PIG svn from feb-01
>            Reporter: Tamir Kamara
>
> When using the DISTINCT function many of the map tasks are being killed 
> because of failure to report for 600 seconds. It seems that PIG-646 should 
> have addressed this but I'm still seeing many errors like this:
> 2009-02-21 11:41:53,916 INFO org.apache.hadoop.mapred.MapTask: Starting flush 
> of map output
> 2009-02-21 11:41:57,727 WARN org.apache.pig.builtin.Distinct$Intermediate: No 
> reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> 2009-02-21 11:41:57,730 WARN org.apache.pig.builtin.Distinct$Intermediate: No 
> reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> My query:
> r0 = load 'domain-org/*' as (domain:chararray, org:chararray);
> r3 = GROUP r0 BY org parallel 18;
> r4 = FOREACH r3 {
>        r5 = r0.domain;
>        r6 = distinct r5;
>        GENERATE group as org, COUNT(r6) as domains;
> }
> store r4 into 'org-domain-count';
> the source files are 21GB in total with some 800M lines, 60M distinct domains 
> and 80K distinct orgs. Some orgs have 50M domains in them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to