Prabhu Joseph created TEZ-3878:
----------------------------------
Summary: Fetcher does not log the actual error message thrown by
ShuffleHandler
Key: TEZ-3878
URL: https://issues.apache.org/jira/browse/TEZ-3878
Project: Apache Tez
Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Prabhu Joseph
A job is failing with reduce tasks failed to fetch map output and the
NodeManager ShuffleHandler failed to serve the map outputs with some
IOException like below. ShuffleHandler sends the actual error message in
response inside sendError() but the Fetcher does not log this message.
Logs from NodeManager ShuffleHandler:
{code}
2017-12-06 17:11:11,040 ERROR mapred.ShuffleHandler
(ShuffleHandler.java:messageReceived(962)) - Shuffle error in populating
headers :
java.io.IOException: Error Reading IndexFile
at
org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:123)
at
org.apache.hadoop.mapred.IndexCache.getIndexInformation(IndexCache.java:68)
at
org.apache.hadoop.mapred.ShuffleHandler$Shuffle.getMapOutputInfo(ShuffleHandler.java:1066)
at
org.apache.hadoop.mapred.ShuffleHandler$Shuffle.populateHeaders(ShuffleHandler.java:1086)
at
org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:958)
at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
at
org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:142)
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
at
org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:148)
at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
at
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at
org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
at
org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
at
org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
at
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at
org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
at
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Owner 'hbase' for path
/grid/7/hadoop/yarn/local/usercache/bde/appcache/application_1512457770852_9447/output/attempt_1512457770852_9447_1_01_000007_0_10004/file.out.index
did not match expected owner 'bde'
at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:285)
at
org.apache.hadoop.io.SecureIOUtils.forceSecureOpenFSDataInputStream(SecureIOUtils.java:174)
at
org.apache.hadoop.io.SecureIOUtils.openFSDataInputStream(SecureIOUtils.java:158)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:70)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:62)
at
org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:119)
{code}
Fetcher logs below without the actual error message:
{code}
7-12-07 00:34:01,372 [ERROR] [ShuffleAndMergeRunner {Map_1}]
|orderedgrouped.Shuffle|: Map_1: ShuffleRunner failed with error
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
error in shuffle in fetcher {Map_1} #2
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:357)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch
failures and insufficient progress!failureCounts=1, pendingInputs=1,
fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:811)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:552)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:321)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)