[jira] [Commented] (SPARK-12826) Spark Workers do not attempt reconnect or exit on connection failure.

Alan Braithwaite (JIRA) Thu, 14 Jan 2016 14:53:17 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15099079#comment-15099079
 ]


Alan Braithwaite commented on SPARK-12826:
------------------------------------------

Yes, we set it to listen to the any address because we're scheduling the spark 
master with marathon and can't know the address ahead of time.

Furthermore, we can't listen on the same address which is being advertised by 
the proxy.

> Spark Workers do not attempt reconnect or exit on connection failure.
> ---------------------------------------------------------------------
>
>                 Key: SPARK-12826
>                 URL: https://issues.apache.org/jira/browse/SPARK-12826
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Alan Braithwaite
>
> Spark version 1.6.0 Hadoop 2.6.0 CDH 5.4.2
> We're running behind a tcp proxy (10.14.12.11:7077 is the tcp proxy listen 
> address in the example, upstreaming to the spark master listening on 9682 and 
> a different IP)
> To reproduce, I started a spark worker, let it successfully connect to the 
> master through the proxy, then tcpkill'd the connection on the Worker.  
> Nothing is logged from the code handling reconnection attempts.
> {code}
> 16/01/14 18:23:30 INFO Worker: Connecting to master 
> spark-master.example.com:7077...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Creating new connection to 
> spark-master.example.com/10.14.12.11:7077
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Connection to 
> spark-master.example.com/10.14.12.11:7077 successful, running bootstraps...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Successfully created 
> connection to spark-master.example.com/10.14.12.11:7077 after 1 ms (0 ms 
> spent in bootstraps)
> 16/01/14 18:23:30 DEBUG Recycler: -Dio.netty.recycler.maxCapacity.default: 
> 262144
> 16/01/14 18:23:30 INFO Worker: Successfully registered with master 
> spark://0.0.0.0:9682
> 16/01/14 18:23:30 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /var/lib/spark/work
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:41:31 WARN TransportChannelHandler: Exception in connection from 
> spark-master.example.com/10.14.12.11:7077
> java.io.IOException: Connection reset by peer
>       at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>       at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>       at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>       at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>       at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>       at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>       at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>       at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>       at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>       at java.lang.Thread.run(Thread.java:745)
> -- nothing more is logged, going on 15 minutes --
> $ ag -C5 Disconn 
> core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
> 313    registrationRetryTimer.foreach(_.cancel(true))
> 314    registrationRetryTimer = None
> 315  }
> 316
> 317  private def registerWithMaster() {
> 318    // onDisconnected may be triggered multiple times, so don't attempt 
> registration
> 319    // if there are outstanding registration attempts scheduled.
> 320    registrationRetryTimer match {
> 321      case None =>
> 322        registered = false
> 323        registerMasterFutures = tryRegisterAllMasters()
> --
> 549        finishedExecutors.values.toList, drivers.values.toList,
> 550        finishedDrivers.values.toList, activeMasterUrl, cores, memory,
> 551        coresUsed, memoryUsed, activeMasterWebUiUrl))
> 552  }
> 553
> 554  override def onDisconnected(remoteAddress: RpcAddress): Unit = {
> 555    if (master.exists(_.address == remoteAddress)) {
> 556      logInfo(s"$remoteAddress Disassociated !")
> 557      masterDisconnected()
> 558    }
> 559  }
> 560
> 561  private def masterDisconnected() {
> 562    logError("Connection to master failed! Waiting for master to 
> reconnect...")
> 563    connected = false
> 564    registerWithMaster()
> 565  }
> 566
> {code}
> Please let me know if there is any other information I can provide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-12826) Spark Workers do not attempt reconnect or exit on connection failure.

Reply via email to