[
https://issues.apache.org/jira/browse/SPARK-11228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-11228:
---------------------------------
Labels: bulk-closed (was: )
> Job stuck in Executor failure loop when NettyTransport failed to bind
> ---------------------------------------------------------------------
>
> Key: SPARK-11228
> URL: https://issues.apache.org/jira/browse/SPARK-11228
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 1.5.1
> Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
> Reporter: Romi Kuntsman
> Priority: Major
> Labels: bulk-closed
>
> I changed my network connection while a local spark cluster is running. In
> port 8080, I see the master and worker running.
> I'm running Spark in Java in client mode, so the driver is running inside my
> IDE. When trying to start a job on the local spark cluster, I get an endless
> loop of the errors below at #1.
> It only stops when I kill the application manually.
> When looking at the worker log, I see an endless loop of the errors below at
> #2.
> Expected behaviour would be failing the job after a few failed retries /
> timeout.
> (IP anonymized to 1.2.3.4)
> 1. Errors see on driver:
> 2015-10-21 11:20:54,793 INFO [org.apache.spark.scheduler.TaskSchedulerImpl]
> Adding task set 0.0 with 2 tasks
> 2015-10-21 11:20:55,847 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/1 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:55,847 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor
> app-20151021112052-0005/1 removed: Command exited with code 1
> 2015-10-21 11:20:55,848 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to
> remove non-existent executor 1
> 2015-10-21 11:20:55,848 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added:
> app-20151021112052-0005/2 on worker-20151021090623-1.2.3.4-57305
> (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:55,848 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted
> executor ID app-20151021112052-0005/2 on hostPort 1.2.3.4:57305 with 1 cores,
> 4.9 GB RAM
> 2015-10-21 11:20:55,849 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/2 is now LOADING
> 2015-10-21 11:20:55,852 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/2 is now RUNNING
> 2015-10-21 11:20:57,165 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/2 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:57,165 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor
> app-20151021112052-0005/2 removed: Command exited with code 1
> 2015-10-21 11:20:57,166 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to
> remove non-existent executor 2
> 2015-10-21 11:20:57,166 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added:
> app-20151021112052-0005/3 on worker-20151021090623-1.2.3.4-57305
> (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:57,167 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted
> executor ID app-20151021112052-0005/3 on hostPort 1.2.3.4:57305 with 1 cores,
> 4.9 GB RAM
> 2015-10-21 11:20:57,167 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/3 is now LOADING
> 2015-10-21 11:20:57,169 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/3 is now RUNNING
> 2015-10-21 11:20:58,531 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/3 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:58,531 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor
> app-20151021112052-0005/3 removed: Command exited with code 1
> 2015-10-21 11:20:58,532 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to
> remove non-existent executor 3
> 2015-10-21 11:20:58,532 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added:
> app-20151021112052-0005/4 on worker-20151021090623-1.2.3.4-57305
> (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:58,532 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted
> executor ID app-20151021112052-0005/4 on hostPort 1.2.3.4:57305 with 1 cores,
> 4.9 GB RAM
> 2015-10-21 11:20:58,533 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/4 is now LOADING
> 2015-10-21 11:20:58,535 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/4 is now RUNNING
> 2015-10-21 11:20:59,932 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/4 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:59,933 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor
> app-20151021112052-0005/4 removed: Command exited with code 1
> 2015-10-21 11:20:59,933 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to
> remove non-existent executor 4
> 2015-10-21 11:20:59,933 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added:
> app-20151021112052-0005/5 on worker-20151021090623-1.2.3.4-57305
> (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:59,934 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted
> executor ID app-20151021112052-0005/5 on hostPort 1.2.3.4:57305 with 1 cores,
> 4.9 GB RAM
> 2015-10-21 11:20:59,935 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/5 is now LOADING
> 2015-10-21 11:20:59,937 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/5 is now RUNNING
> 2015-10-21 11:21:01,338 INFO
> [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated:
> app-20151021112052-0005/5 is now EXITED (Command exited with code 1)
> 2015-10-21 11:21:01,338 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor
> app-20151021112052-0005/5 removed: Command exited with code 1
> 2015-10-21 11:21:01,339 INFO
> [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to
> remove non-existent executor 5
> 2. Errors seen on workers:
> 15/10/21 11:20:53 INFO Remoting: Starting remoting
> 15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0,
> shutting down Netty transport
> 15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on
> port 0. Attempting port 1.
> 15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
> 15/10/21 11:20:53 INFO Remoting: Starting remoting
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote
> daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated
> abrubtly. Attempting to shut down transports
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
> shut down.
> 15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0,
> shutting down Netty transport
> 15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on
> port 0. Attempting port 1.
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote
> daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started
> 15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated
> abrubtly. Attempting to shut down transports
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
> shut down.
> 15/10/21 11:20:53 INFO Remoting: Starting remoting
> 15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0,
> shutting down Netty transport
> 15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on
> port 0. Attempting port 1.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote
> daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started
> 15/10/21 11:20:54 INFO Remoting: Starting remoting
> 15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0,
> shutting down Netty transport
> 15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on
> port 0. Attempting port 1.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
> down remote daemon.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote
> daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:54 ERROR Remoting: Remoting system has been terminated
> abrubtly. Attempting to shut down transports
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
> shut down.
> 15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]