[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Vinod K V (JIRA) Fri, 17 Oct 2008 11:56:10 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640625#action_12640625
 ]


Vinod K V commented on HADOOP-4296:
-----------------------------------

Dhruba,

Sorry for not looking at the entire patch earlier. The place where you are 
using MIN_TIME_BEFORE_RETIRE in finalizeJob to prevent retiring very recently 
finished jobs seems wrong to me. You should check for the condition before the 
job is even removed from JT.

Further, are you sure we want to throw an exception in the job-client? An 
exception would throw a non-zero error code to the client; I'd rather log the 
message and just break out of the loop, thus returning a zero exit code, as we 
know for sure that the job is indeed complete.

Having said that, I think introducing MIN_TIME_BEFORE_RETIRE here, in this 
blocker, doesn't seem to be very right. It is a feature. The later fix, 
patching the client, is the correct and sufficient fix for the problem that 
Joydeep initially reported.

> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4296_jt_delayretire.patch, 4296_jt_delayretire2.patch, 
> 4296_jt_delayretire3.patch
>
>
> At very busy times - we get a wave of job client failures all at the same 
> time. the failures come when the job is about to complete. when we look at 
> the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
>       at 
> com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Reply via email to