Attila Sasvari created MAPREDUCE-7003:
-----------------------------------------
Summary: Indefinite retries of getJobSummary() if a job summary
file is corrupt
Key: MAPREDUCE-7003
URL: https://issues.apache.org/jira/browse/MAPREDUCE-7003
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: jobhistoryserver
Reporter: Attila Sasvari
Having a corrupt job summary file in the {{/user/history/done_intermediate}}
directory in HDFS, e.g.
{{/user/history/done_intermediate/oozie/job_1111111111111_111111.summary}}
before moving it to {{/user/history/done}}, results in indefinite retries of
{{org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getJobSummary()}}. JHS
will log recurring exceptions like:
{code}
2017-11-03 01:01:01,124 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O
error constructing remote block reader.
java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR,
self=/ABC.DEF.GHI:JKLMN, remote=/ABC.DEF.GHI:JKLMN, for file
/user/history/done_intermediate/admin/job_1111111111111_1111.summary, for pool
XX-999999999-ABC.DEF.GHI-1111111111111 block 1111111111_22222
at
org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
at
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881)
at
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759)
at
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376)
at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:652)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:879)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:932)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:732)
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:337)
at java.io.DataInputStream.readUTF(DataInputStream.java:589)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getJobSummary(HistoryFileManager.java:1059)
{code}
(INFO and ERROR logs are omitted)
To reproduce it:
- start JHS in debug mode (use JVM parameter
{{-agentlib:jdwp=transport=dt_socket,server=y,address=45555,suspend=n}} when
starting it)
- attach debugger to the process and add a break point to stop in
{{org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getJobSummary()}}
- start a map reduce job and wait until breakpoint is hit
- delete or rename physical block on the datanode(s) for the job summary file
(e.g. use {{hdfs fsck
/user/history/done_intermediate/oozie/job_1111111111111_111111.summary -blocks
-locations -files}} to get the block name; search for the block the on
datanode(s) and remove/ rename it)
- detach debugger
- examine JHS log files
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]