Igor Berman created SPARK-24641:
-----------------------------------

             Summary: Spark-Mesos integration doesn't respect request to abort 
itself
                 Key: SPARK-24641
                 URL: https://issues.apache.org/jira/browse/SPARK-24641
             Project: Spark
          Issue Type: Bug
          Components: Mesos, Shuffle
    Affects Versions: 2.2.0
            Reporter: Igor Berman


Hi,
lately we came across following corner scenario:
We are using dynamic allocation with external shuffle service that is managed 
by marathon.
 
Due to some network/operation issue, the external shuffle service on one of the 
machines(mesos-slaves) is not available for few seconds(e.g. marathon haven't 
provisioned yet the external shuffle service on particular node, but framework 
itself already accepted offer on this node and tries to startup executor)
 
This makes framework(spark driver) to fail and I see error from stderr of 
driver(seems like mesos-agent asks driver to abort itself), however spark 
context continues to run(seems like in kind of zombi mode, since it can't 
release resources to cluster and can't get additional offers since the 
framework is aborted from mesos perspective)
 
The framework in mesos UI move to "inactive" state.

[~skonto] [~susanxhuynh] any input on this problem? Have you came across such 
behavior?


I'm ready to work on some patch, but currently I don't understand where to 
start, seems like driver is too fragile in this sense and something in 
mesos-spark integration is missing
 
 
{code:java}
I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with 
15d9838f-b266-413b-842d-f7c3567bd04a-0051 Exception in thread "Thread-295" 
java.io.IOException: Failed to connect tomy-company.com/10.106.14.61:7337       
  at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
         at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
         at 
org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
         at 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
 Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection refused: my-company.com/10.106.14.61:7337         at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)         at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)         
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
         at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
         at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)     
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
         at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)    
     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)         at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
         at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
         at java.lang.Thread.run(Thread.java:748) I0412 07:35:12.032925   277 
sched.cpp:2055] Asked to abort the driver I0412 07:35:12.033035   277 
sched.cpp:1233] Aborting framework 15d9838f-b266-413b-842d-f7c3567bd04a-0051  
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to