[ 
https://issues.apache.org/jira/browse/TEZ-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated TEZ-3878:
-------------------------------
    Labels: supportability  (was: )

> Fetcher does not log the actual error message thrown by ShuffleHandler
> ----------------------------------------------------------------------
>
>                 Key: TEZ-3878
>                 URL: https://issues.apache.org/jira/browse/TEZ-3878
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Prabhu Joseph
>              Labels: supportability
>
> A job is failing with reduce tasks failed to fetch map output and the 
> NodeManager ShuffleHandler failed to serve the map outputs with some 
> IOException like below. ShuffleHandler sends the actual error message in 
> response inside sendError() but the Fetcher does not log this message.
> Logs from NodeManager ShuffleHandler:
> {code}
> 2017-12-06 17:11:11,040 ERROR mapred.ShuffleHandler 
> (ShuffleHandler.java:messageReceived(962)) - Shuffle error in populating 
> headers :
> java.io.IOException: Error Reading IndexFile
>         at 
> org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:123)
>         at 
> org.apache.hadoop.mapred.IndexCache.getIndexInformation(IndexCache.java:68)
>         at 
> org.apache.hadoop.mapred.ShuffleHandler$Shuffle.getMapOutputInfo(ShuffleHandler.java:1066)
>         at 
> org.apache.hadoop.mapred.ShuffleHandler$Shuffle.populateHeaders(ShuffleHandler.java:1086)
>         at 
> org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:958)
>         at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>         at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
>         at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
>         at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:142)
>         at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
>         at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
>         at 
> org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:148)
>         at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>         at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
>         at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
>         at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
>         at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
>         at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
>         at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
>         at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>         at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
>         at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
>         at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
>         at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
>         at 
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
>         at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
>         at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
>         at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
>         at 
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>         at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>         at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Owner 'hbase' for path 
> /grid/7/hadoop/yarn/local/usercache/bde/appcache/application_1512457770852_9447/output/attempt_1512457770852_9447_1_01_000007_0_10004/file.out.index
>  did not match expected owner 'bde'
>         at 
> org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:285)
>         at 
> org.apache.hadoop.io.SecureIOUtils.forceSecureOpenFSDataInputStream(SecureIOUtils.java:174)
>         at 
> org.apache.hadoop.io.SecureIOUtils.openFSDataInputStream(SecureIOUtils.java:158)
>         at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:70)
>         at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:62)
>         at 
> org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:119)
> {code}
> Fetcher logs below without the actual error message:
> {code}
> 7-12-07 00:34:01,372 [ERROR] [ShuffleAndMergeRunner {Map_1}] 
> |orderedgrouped.Shuffle|: Map_1: ShuffleRunner failed with error
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
>  error in shuffle in fetcher {Map_1} #2
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:357)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:334)
>         at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch 
> failures and insufficient progress!failureCounts=1, pendingInputs=1, 
> fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:811)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:552)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:321)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to