[ 
https://issues.apache.org/jira/browse/PIG-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689540#action_12689540
 ] 

Santhosh Srinivasan commented on PIG-685:
-----------------------------------------

Currently, the distinct UDF is reporting progress once every 1000 tuples. It 
could be the case that each tuple is fairly large. The number 1000 was picked 
heuristically. We could reduce it to 100 or something in that range.

Any other thoughts?

> Distinct UDF progress reports
> -----------------------------
>
>                 Key: PIG-685
>                 URL: https://issues.apache.org/jira/browse/PIG-685
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>         Environment: Hadoop 0.18.3 on redhat, PIG svn from feb-01
>            Reporter: Tamir Kamara
>
> When using the DISTINCT function many of the map tasks are being killed 
> because of failure to report for 600 seconds. It seems that PIG-646 should 
> have addressed this but I'm still seeing many errors like this:
> 2009-02-21 11:41:53,916 INFO org.apache.hadoop.mapred.MapTask: Starting flush 
> of map output
> 2009-02-21 11:41:57,727 WARN org.apache.pig.builtin.Distinct$Intermediate: No 
> reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> 2009-02-21 11:41:57,730 WARN org.apache.pig.builtin.Distinct$Intermediate: No 
> reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> My query:
> r0 = load 'domain-org/*' as (domain:chararray, org:chararray);
> r3 = GROUP r0 BY org parallel 18;
> r4 = FOREACH r3 {
>        r5 = r0.domain;
>        r6 = distinct r5;
>        GENERATE group as org, COUNT(r6) as domains;
> }
> store r4 into 'org-domain-count';
> the source files are 21GB in total with some 800M lines, 60M distinct domains 
> and 80K distinct orgs. Some orgs have 50M domains in them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to