[
https://issues.apache.org/jira/browse/SPARK-24641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16521613#comment-16521613
]
Igor Berman commented on SPARK-24641:
-------------------------------------
And another comment. I'm wondering if at initial design there was intention to
differentiate between
1. error of registering driver at callback
([https://github.com/apache/spark/blob/a5849ad9a3e5d41b5938faa7c592bcc6aec36044/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/mesos/MesosExternalShuffleClient.java#L98)]
which seems not critical at all and
2. error in client creation
([https://github.com/apache/spark/blob/a5849ad9a3e5d41b5938faa7c592bcc6aec36044/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/mesos/MesosExternalShuffleClient.java#L76)]
The first error produces only warning, while the later - kind of aborts the
driver(not really see above), the exception is propagated to the higher level,
but the effect of failing to register driver at some external shuffle service
is exactly the same - the spark won't be able to remove shuffle files
> Spark-Mesos integration doesn't respect request to abort itself
> ---------------------------------------------------------------
>
> Key: SPARK-24641
> URL: https://issues.apache.org/jira/browse/SPARK-24641
> Project: Spark
> Issue Type: Bug
> Components: Mesos, Shuffle
> Affects Versions: 2.2.0
> Reporter: Igor Berman
> Priority: Major
>
> Hi,
> lately we came across following corner scenario:
> We are using dynamic allocation with external shuffle service that is managed
> by marathon.
>
> Due to some network/operation issue, the external shuffle service on one of
> the machines(mesos-slaves) is not available for few seconds(e.g. marathon
> haven't provisioned yet the external shuffle service on particular node, but
> framework itself already accepted offer on this node and tries to startup
> executor)
>
> This makes framework(spark driver) to fail and I see error from stderr of
> driver(seems like mesos-agent asks driver to abort itself), however spark
> context continues to run(seems like in kind of zombi mode, since it can't
> release resources to cluster and can't get additional offers since the
> framework is aborted from mesos perspective)
>
> The framework in mesos UI move to "inactive" state.
> [~skonto] [~susanxhuynh] any input on this problem? Have you came across such
> behavior?
> I'm ready to work on some patch, but currently I don't understand where to
> start, seems like driver is too fragile in this sense and something in
> mesos-spark integration is missing
>
>
> {code:java}
> I0412 07:31:25.827283 274 sched.cpp:759] Framework registered with
> 15d9838f-b266-413b-842d-f7c3567bd04a-0051 Exception in thread "Thread-295"
> java.io.IOException: Failed to connect tomy-company.com/10.106.14.61:7337
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
> at
> org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
> at
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
> Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
> Connection refused: my-company.com/10.106.14.61:7337 at
> sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748) I0412 07:35:12.032925 277
> sched.cpp:2055] Asked to abort the driver I0412 07:35:12.033035 277
> sched.cpp:1233] Aborting framework 15d9838f-b266-413b-842d-f7c3567bd04a-0051
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]