[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Olga Natkovich (JIRA) Wed, 28 Jul 2010 17:29:43 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893446#action_12893446
 ]


Olga Natkovich commented on PIG-1249:
-------------------------------------

Comments for the documentation:

+    /**
+     * Currently the estimation of reducer number is only applied to HDFS, The 
estimation is based on the input size of data storage on HDFS.
+     * Two parameters can been configured for the estimation, one is 
pig.exec.reducers.max which constrain the maximum number of reducer task 
(default is 999). The other
+     * is pig.exec.reducers.bytes.per.reducer(default value is 1000*1000*1000) 
which means the how much data can been handled for each reducer.
+     * e.g. the following is your pig script
+     * a = load '/data/a';
+     * b = load '/data/b';
+     * c = join a by $0, b by $0;
+     * store c into '/tmp';
+     *
+     * The size of /data/a is 1000*1000*1000, and size of /data/b is 
2*1000*1000*1000.
+     * Then the estimated reducer number is 
(1000*1000*1000+2*1000*1000*1000)/(1000*1000*1000)=3


> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, 
> PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts 
> which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge 
> data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Reply via email to