[
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484748#comment-15484748
]
Thomas Graves commented on SPARK-17321:
---------------------------------------
Not sure I follow this comment. So you are using NM recovery with the recovery
path specified?
And you saw an error in the spark shuffle creating or writing to the DB but the
NM stayed up ok writing its recovery data to the same disk?
> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --------------------------------------------------------------------------
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 1.6.2, 2.0.0
> Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext:
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at
> org.apache.spark.network.server.TransportRequestHandler.<init>(TransportRequestHandler.java:77)
> at
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good
> disks if the first one is broken?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]