[
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15474557#comment-15474557
]
Thomas Graves commented on SPARK-17321:
---------------------------------------
so there are 2 possible things here:
1) You are using YARN NM recovery. If this is the case SPARK-14963 should
prevent this problem as the recovery path is supposed to be critical to NM and
NM should not start if its bad
2) You aren't using NM recovery. If this is the case then you probably don't
really care about the levelDB being saved because you aren't expecting things
to live across Nm restarts. In this case if people are having issues like this
perhaps we should change the code to be conditionalized on NM recovery or a
spark config.
Which case are you running?
> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --------------------------------------------------------------------------
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 1.6.2, 2.0.0
> Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext:
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at
> org.apache.spark.network.server.TransportRequestHandler.<init>(TransportRequestHandler.java:77)
> at
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good
> disks if the first one is broken?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]