Rahul Jain created MAPREDUCE-4442:
-------------------------------------
Summary: Accessing hadoop counters from a job is unreliable in
yarn during in AM process cleanup window
Key: MAPREDUCE-4442
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4442
Project: Hadoop Map/Reduce
Issue Type: Bug
Affects Versions: 2.0.0-alpha
Reporter: Rahul Jain
We found this issue during our tests moving from MapReduceV1 to MapReduceV2. A
few of our applications access job counters multiple times:
a) After submission of job, while job is execution (works fine)
b) Right after job complete notification is received (works fine)
c) Few seconds after job complete notification (fails most of the times).
The error snippet is as follows:
{code}
2012-07-12 19:12:29,039 WARN [Client] Unexpected error reading responses on
connection Thread[IPC Client (1252749669) connection to
sjc1-ciq-ibm-grid07.carrieriq.com/10.202.50.187:47944 from hadoop,5,main]
java.lang.NullPointerException
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:852)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:781)
2012-07-12 19:12:29,044 INFO [ClientServiceDelegate] Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2012-07-12 19:12:29,132 INFO [ClientServiceDelegate] Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2012-07-12 19:12:29,216 ERROR [UserGroupInformation] PriviledgedActionException
as:hadoop (auth:SIMPLE) cause:java.io.IOException
2012-07-12 19:12:29,216 WARN [BaseOutputStageJob] getJobCounters: Unable to
retrieve counters. null
java.io.IOException
at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315)
at
org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:335)
at
org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:470)
at org.apache.hadoop.mapreduce.Job$8.run(Job.java:719)
at org.apache.hadoop.mapreduce.Job$8.run(Job.java:716)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:716)
at
org.apache.hadoop.mapred.JobClient$NetworkedJob.getCounters(JobClient.java:396)
{code}
I'll attach the logs for AM and resource mgr separately, however no unusual
activity is seen in those.
This makes me suspect that we have a race condition in the code trying to
access job counters when AM is finishing up and the job hasn't moved to history
server yet.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira