[
https://issues.apache.org/jira/browse/MAPREDUCE-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427590#comment-15427590
]
Weiwei Yang commented on MAPREDUCE-6762:
----------------------------------------
The error we saw from Pig console
{code}
2016-07-20 07:28:13,625 [uber-SubtaskRunner] INFO
org.apache.hadoop.mapred.ClientServiceDelegate - Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-07-20 07:28:16,252 [JobControl] ERROR
org.apache.pig.backend.hadoop23.PigJobControl - Error while trying to run jobs.
java.lang.NullPointerException
at org.apache.hadoop.mapreduce.Job.getJobName(Job.java:426)
at
org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.toString(ControlledJob.java:93)
at java.lang.String.valueOf(String.java:2982)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at
org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:182)
at java.lang.Thread.run(Thread.java:745)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
{code}
The error we saw from app-master log (indicating that failure when writing job
meta files)
{code}
2016-08-10 07:46:54,862 INFO [Thread-1245]
org.apache.hadoop.service.AbstractService: Service
org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state STOPPED; cause:
org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.87.225.170:40913 remote=/10.87.225.174:50010]
org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.87.225.170:40913 remote=/10.87.225.174:50010]
at
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:580)
at
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:374)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at
org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
at
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1626)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1126)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:561)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:609)
Caused by: java.net.SocketTimeoutException: 70000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.87.225.170:40913
remote=/10.87.225.174:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2278)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1020)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:990)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1131)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
{code}
So the cause looks like
* Datanode was too busy to answer JHS's request to flush job meta files
* Job meta files missing
* Job client failed to get job status update
* {{Job.status}} resets to null
* {{Job.getJobName}} failed with NPE
> ControlledJob#toString failed with NPE when job status is not successfully
> updated
> ----------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6762
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6762
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.7.2
> Reporter: Weiwei Yang
>
> This issue was found from a cluster where Pig query occasionally failed on
> NPE. Pig uses JobControl API to track MR job status, but sometimes Job
> History Server failed to flush job meta files to HDFS which caused the status
> update failed. Then we get NPE in
> {{org.apache.hadoop.mapreduce.Job.getJobName}}. The result of this situation
> is quite confusing: Pig query failed, job history is missing, but the job
> status on Yarn is succeed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]