[jira] [Updated] (SPARK-11228) Job stuck in Executor failure loop when NettyTransport failed to bind

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:48:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-11228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-11228:
---------------------------------
    Labels: bulk-closed  (was: )

> Job stuck in Executor failure loop when NettyTransport failed to bind
> ---------------------------------------------------------------------
>
>                 Key: SPARK-11228
>                 URL: https://issues.apache.org/jira/browse/SPARK-11228
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.5.1
>         Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
>            Reporter: Romi Kuntsman
>            Priority: Major
>              Labels: bulk-closed
>
> I changed my network connection while a local spark cluster is running. In 
> port 8080, I see the master and worker running. 
> I'm running Spark in Java in client mode, so the driver is running inside my 
> IDE. When trying to start a job on the local spark cluster, I get an endless 
> loop of the errors below at #1.
> It only stops when I kill the application manually.
> When looking at the worker log, I see an endless loop of the errors below at 
> #2.
> Expected behaviour would be failing the job after a few failed retries / 
> timeout.
> (IP anonymized to 1.2.3.4)
> 1. Errors see on driver:
> 2015-10-21 11:20:54,793 INFO  [org.apache.spark.scheduler.TaskSchedulerImpl] 
> Adding task set 0.0 with 2 tasks
> 2015-10-21 11:20:55,847 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/1 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:55,847 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
> app-20151021112052-0005/1 removed: Command exited with code 1
> 2015-10-21 11:20:55,848 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
> remove non-existent executor 1
> 2015-10-21 11:20:55,848 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: 
> app-20151021112052-0005/2 on worker-20151021090623-1.2.3.4-57305 
> (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:55,848 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted 
> executor ID app-20151021112052-0005/2 on hostPort 1.2.3.4:57305 with 1 cores, 
> 4.9 GB RAM
> 2015-10-21 11:20:55,849 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/2 is now LOADING
> 2015-10-21 11:20:55,852 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/2 is now RUNNING
> 2015-10-21 11:20:57,165 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/2 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:57,165 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
> app-20151021112052-0005/2 removed: Command exited with code 1
> 2015-10-21 11:20:57,166 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
> remove non-existent executor 2
> 2015-10-21 11:20:57,166 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: 
> app-20151021112052-0005/3 on worker-20151021090623-1.2.3.4-57305 
> (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:57,167 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted 
> executor ID app-20151021112052-0005/3 on hostPort 1.2.3.4:57305 with 1 cores, 
> 4.9 GB RAM
> 2015-10-21 11:20:57,167 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/3 is now LOADING
> 2015-10-21 11:20:57,169 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/3 is now RUNNING
> 2015-10-21 11:20:58,531 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/3 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:58,531 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
> app-20151021112052-0005/3 removed: Command exited with code 1
> 2015-10-21 11:20:58,532 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
> remove non-existent executor 3
> 2015-10-21 11:20:58,532 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: 
> app-20151021112052-0005/4 on worker-20151021090623-1.2.3.4-57305 
> (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:58,532 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted 
> executor ID app-20151021112052-0005/4 on hostPort 1.2.3.4:57305 with 1 cores, 
> 4.9 GB RAM
> 2015-10-21 11:20:58,533 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/4 is now LOADING
> 2015-10-21 11:20:58,535 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/4 is now RUNNING
> 2015-10-21 11:20:59,932 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/4 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:59,933 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
> app-20151021112052-0005/4 removed: Command exited with code 1
> 2015-10-21 11:20:59,933 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
> remove non-existent executor 4
> 2015-10-21 11:20:59,933 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: 
> app-20151021112052-0005/5 on worker-20151021090623-1.2.3.4-57305 
> (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:59,934 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted 
> executor ID app-20151021112052-0005/5 on hostPort 1.2.3.4:57305 with 1 cores, 
> 4.9 GB RAM
> 2015-10-21 11:20:59,935 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/5 is now LOADING
> 2015-10-21 11:20:59,937 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/5 is now RUNNING
> 2015-10-21 11:21:01,338 INFO  
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: 
> app-20151021112052-0005/5 is now EXITED (Command exited with code 1)
> 2015-10-21 11:21:01,338 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor 
> app-20151021112052-0005/5 removed: Command exited with code 1
> 2015-10-21 11:21:01,339 INFO  
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to 
> remove non-existent executor 5
> 2. Errors seen on workers:
> 15/10/21 11:20:53 INFO Remoting: Starting remoting
> 15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0, 
> shutting down Netty transport
> 15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on 
> port 0. Attempting port 1.
> 15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/10/21 11:20:53 INFO Remoting: Starting remoting
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated 
> abrubtly. Attempting to shut down transports
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting 
> shut down.
> 15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0, 
> shutting down Netty transport
> 15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on 
> port 0. Attempting port 1.
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started
> 15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated 
> abrubtly. Attempting to shut down transports
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting 
> shut down.
> 15/10/21 11:20:53 INFO Remoting: Starting remoting
> 15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0, 
> shutting down Netty transport
> 15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on 
> port 0. Attempting port 1.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started
> 15/10/21 11:20:54 INFO Remoting: Starting remoting
> 15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0, 
> shutting down Netty transport
> 15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on 
> port 0. Attempting port 1.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:54 ERROR Remoting: Remoting system has been terminated 
> abrubtly. Attempting to shut down transports
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remoting 
> shut down.
> 15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-11228) Job stuck in Executor failure loop when NettyTransport failed to bind

Reply via email to