[
https://issues.apache.org/jira/browse/TEZ-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prabhu Joseph updated TEZ-3878:
-------------------------------
Labels: supportability (was: )
> Fetcher does not log the actual error message thrown by ShuffleHandler
> ----------------------------------------------------------------------
>
> Key: TEZ-3878
> URL: https://issues.apache.org/jira/browse/TEZ-3878
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Prabhu Joseph
> Labels: supportability
>
> A job is failing with reduce tasks failed to fetch map output and the
> NodeManager ShuffleHandler failed to serve the map outputs with some
> IOException like below. ShuffleHandler sends the actual error message in
> response inside sendError() but the Fetcher does not log this message.
> Logs from NodeManager ShuffleHandler:
> {code}
> 2017-12-06 17:11:11,040 ERROR mapred.ShuffleHandler
> (ShuffleHandler.java:messageReceived(962)) - Shuffle error in populating
> headers :
> java.io.IOException: Error Reading IndexFile
> at
> org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:123)
> at
> org.apache.hadoop.mapred.IndexCache.getIndexInformation(IndexCache.java:68)
> at
> org.apache.hadoop.mapred.ShuffleHandler$Shuffle.getMapOutputInfo(ShuffleHandler.java:1066)
> at
> org.apache.hadoop.mapred.ShuffleHandler$Shuffle.populateHeaders(ShuffleHandler.java:1086)
> at
> org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:958)
> at
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
> at
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
> at
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:142)
> at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
> at
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
> at
> org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:148)
> at
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
> at
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
> at
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
> at
> org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
> at
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
> at
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
> at
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
> at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
> at
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> at
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> at
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
> at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
> at
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
> at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
> at
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> at
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> at
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Owner 'hbase' for path
> /grid/7/hadoop/yarn/local/usercache/bde/appcache/application_1512457770852_9447/output/attempt_1512457770852_9447_1_01_000007_0_10004/file.out.index
> did not match expected owner 'bde'
> at
> org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:285)
> at
> org.apache.hadoop.io.SecureIOUtils.forceSecureOpenFSDataInputStream(SecureIOUtils.java:174)
> at
> org.apache.hadoop.io.SecureIOUtils.openFSDataInputStream(SecureIOUtils.java:158)
> at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:70)
> at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:62)
> at
> org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:119)
> {code}
> Fetcher logs below without the actual error message:
> {code}
> 7-12-07 00:34:01,372 [ERROR] [ShuffleAndMergeRunner {Map_1}]
> |orderedgrouped.Shuffle|: Map_1: ShuffleRunner failed with error
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
> error in shuffle in fetcher {Map_1} #2
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:357)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch
> failures and insufficient progress!failureCounts=1, pendingInputs=1,
> fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:811)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:552)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:321)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)