Re: how to implement error thresholds in a map-reduce job ?

Mapred Learn Wed, 16 Nov 2011 00:04:21 -0800

Thanks Harsh for a descriptive response.

This means that all mappers would finish before we can find out if there were 
errors, right ? Even though first mapper might have reached this threshold.


Thanks,

Sent from my iPhone

On Nov 15, 2011, at 9:21 PM, Harsh J <ha...@cloudera.com> wrote:

> Ah so the threshold is job-level, not per task. OK.
> 
> One other way I think would be performant, AND still able to use Hadoop 
> itself would be to keep one reducer for this job, and have that reducer check 
> if the counter of total failed records exceeds the threshold or not. A 
> reducer is guaranteed to have gotten the total aggregate of map side counters 
> since it begins only after all maps complete. The reducer can then go ahead 
> and fail itself to fail the job or pass through. Your maps may output their 
> data directly - the reducer is just to decide if the mappers were alright 
> (Perhaps send failed counts as KV to the reducer, to avoid looking up Hadoop 
> counters from within tasks -- but this would easily apply only to Map-only 
> jobs. For MR jobs, it may be a bit more complicated to add this in, but 
> surely still doable with some partitioner and comparator tweaks).
> 
> But, also good to fail if a single map task itself exceeds > 10. The above is 
> to ensure the global check, while doing this as well would ensure faster 
> failure depending on the situation.
> 
> On 16-Nov-2011, at 1:16 AM, Mapred Learn wrote:
> 
>> Hi Harsh,
>>  
>> My situation is to kill a job when this threshold is reached. If say 
>> threshold is 10. And 2 mappers combined reached this value, how should I 
>> achieve this.
>>  
>> With what you are saying, I think job will fail once a single mapper reaches 
>> that threshold.
>>  
>> Thanks,
>> 
>>  
>> On Tue, Nov 15, 2011 at 11:22 AM, Harsh J <ha...@cloudera.com> wrote:
>> Mapred,
>> 
>> If you fail a task permanently upon encountering a bad situation, you 
>> basically end up failing the job as well, automatically. By controlling the 
>> number of retries (say down to 1 or 2 from 4 default total attempts), you 
>> can also have it fail the job faster.
>> 
>> Is killing the job immediately a necessity? Why?
>> 
>> I s'pose you could call kill from within the mapper, but I've never seen 
>> that as necessary in any situation so far. Whats wrong with letting the job 
>> auto-die as a result of a failing task?
>> 
>> On 16-Nov-2011, at 12:38 AM, Mapred Learn wrote:
>> 
>>> Thanks David for a step-by-step response but this makes error threshold, a 
>>> per mapper threshold. Is there a way to make it per job so that all mappers 
>>> share this value and increment it as a shared counter ?
>>> 
>>>  
>>> On Tue, Nov 15, 2011 at 8:12 AM, David Rosenstrauch <dar...@darose.net> 
>>> wrote:
>>> On 11/14/2011 06:06 PM, Mapred Learn wrote:
>>> Hi,
>>> 
>>> I have a use  case where I want to pass a threshold value to a map-reduce
>>> job. For eg: error records=10.
>>> 
>>> I want map-reduce job to fail if total count of error_records in the job
>>> i.e. all mappers, is reached.
>>> 
>>> How can I implement this considering that each mapper would be processing
>>> some part of the input data ?
>>> 
>>> Thanks,
>>> -JJ
>>> 
>>> 1) Pass in the threshold value as configuration value of the M/R job. 
>>> (i.e., job.getConfiguration().setInt("error_threshold", 10) )
>>> 
>>> 2) Make your mappers implement the Configurable interface.  This will 
>>> ensure that every mapper gets passed a copy of the config object.
>>> 
>>> 3) When you implement the setConf() method in your mapper (which 
>>> Configurable will force you to do), retrieve the threshold value from the 
>>> config and save it in an instance variable in the mapper.  (i.e., int 
>>> errorThreshold = conf.getInt("error_threshold") )
>>> 
>>> 4) In the mapper, when an error record occurs, increment a counter and then 
>>> check if the counter value exceeds the threshold.  If so, throw an 
>>> exception.  (e.g., if (++numErrors >= errorThreshold) throw new 
>>> RuntimeException("Too many errors") )
>>> 
>>> The exception will kill the mapper.  Hadoop will attempt to re-run it, but 
>>> subsequent attempts will also fail for the same reason, and eventually the 
>>> entire job will fail.
>>> 
>>> HTH,
>>> 
>>> DR
>>> 
>> 
>> 
>

Re: how to implement error thresholds in a map-reduce job ?

Reply via email to