[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893446#action_12893446 ]
Olga Natkovich commented on PIG-1249: ------------------------------------- Comments for the documentation: + /** + * Currently the estimation of reducer number is only applied to HDFS, The estimation is based on the input size of data storage on HDFS. + * Two parameters can been configured for the estimation, one is pig.exec.reducers.max which constrain the maximum number of reducer task (default is 999). The other + * is pig.exec.reducers.bytes.per.reducer(default value is 1000*1000*1000) which means the how much data can been handled for each reducer. + * e.g. the following is your pig script + * a = load '/data/a'; + * b = load '/data/b'; + * c = join a by $0, b by $0; + * store c into '/tmp'; + * + * The size of /data/a is 1000*1000*1000, and size of /data/b is 2*1000*1000*1000. + * Then the estimated reducer number is (1000*1000*1000+2*1000*1000*1000)/(1000*1000*1000)=3 > Safe-guards against misconfigured Pig scripts without PARALLEL keyword > ---------------------------------------------------------------------- > > Key: PIG-1249 > URL: https://issues.apache.org/jira/browse/PIG-1249 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.8.0 > Reporter: Arun C Murthy > Assignee: Jeff Zhang > Priority: Critical > Fix For: 0.8.0 > > Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, > PIG_1249_2.patch, PIG_1249_3.patch > > > It would be *very* useful for Pig to have safe-guards against naive scripts > which process a *lot* of data without the use of PARALLEL keyword. > We've seen a fair number of instances where naive users process huge > data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.