[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated MAPREDUCE-4442:
-----------------------------------

    Labels: usability  (was: )
    
> Accessing hadoop counters from a job is unreliable in yarn during AM process 
> cleanup  window
> --------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4442
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4442
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>              Labels: usability
>         Attachments: am_logs_counter_failure.html, 
> rsrc_mgr_logs_counter_failed.txt
>
>
> We found this issue during our tests moving from MapReduceV1 to MapReduceV2. 
> A few of our applications access job counters multiple times:
> a) After submission of job, while job is execution (works fine)
> b) Right after job complete notification is received (works fine)
> c) Few seconds after job complete notification (fails most of the time).
> The error snippet is as follows:
> {code}
> 2012-07-12 19:12:29,039 WARN  [Client] Unexpected error reading responses on 
> connection Thread[IPC Client (1252749669) connection to 
> sjc1-ciq-ibm-grid07.carrieriq.com/10.202.50.187:47944 from hadoop,5,main]
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852)
>       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781)
> 2012-07-12 19:12:29,044 INFO  [ClientServiceDelegate] Application state is 
> completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
> 2012-07-12 19:12:29,132 INFO  [ClientServiceDelegate] Application state is 
> completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
> 2012-07-12 19:12:29,216 ERROR [UserGroupInformation] 
> PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException
> 2012-07-12 19:12:29,216 WARN  [BaseOutputStageJob] getJobCounters: Unable to 
> retrieve counters. null
> java.io.IOException
>       at 
> org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315)
>       at 
> org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:335)
>       at 
> org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:470)
>       at org.apache.hadoop.mapreduce.Job$8.run(Job.java:719)
>       at org.apache.hadoop.mapreduce.Job$8.run(Job.java:716)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>       at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:716)
>       at 
> org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396)
> {code}
> The connection to 10.202.50.187:47944 is actually the connection to AM; 
> appears that we are connecting to AM to get the counters for the successful 
> job and not yet to the history server.
>  
> I'll attach the logs for AM and resource mgr separately, however no unusual 
> activity is seen in those.
> This makes me suspect that we have a race condition in the code trying to 
> access job counters when AM is finishing up and the job hasn't moved to 
> history server yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to