[
https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
XeonZhao updated SPARK-1499:
----------------------------
Comment: was deleted
(was: How to solve this problem ?I have encountered this issue.)
> Workers continuously produce failing executors
> ----------------------------------------------
>
> Key: SPARK-1499
> URL: https://issues.apache.org/jira/browse/SPARK-1499
> Project: Spark
> Issue Type: Bug
> Components: Deploy, Spark Core
> Affects Versions: 0.9.1, 1.0.0
> Reporter: Aaron Davidson
>
> If a node is in a bad state, such that newly started executors fail on
> startup or first use, the Standalone Cluster Worker will happily keep
> spawning new ones. A better behavior would be for a Worker to mark itself as
> dead if it has had a history of continuously producing erroneous executors,
> or else to somehow prevent a driver from re-registering executors from the
> same machine repeatedly.
> Reported on mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E
> Relevant logs:
> {noformat}
> 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated:
> app-20140411190649-0008/4 is now FAILED (Command exited with code 53)
> 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor
> app-20140411190649-0008/4 removed: Command exited with code 53
> 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4
> disconnected, so removing it
> 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4
> (already removed): Failed to create local directory (bad spark.local.dir?)
> 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added:
> app-20140411190649-0008/27 on
> worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614
> (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores
> 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor
> ID app-20140411190649-0008/27 on hostPort
> ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM
> 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated:
> app-20140411190649-0008/27 is now RUNNING
> 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
> Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256
> with 32.7 GB RAM
> 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default
> tbl=wikistats_pd
> 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr
> cmd=get_table : db=default tbl=wikistats_pd
> 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string
> projectcode, string pagename, i32 pageviews, i32 bytes}
> 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with:
> columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string,
> string, int, int] separator=[[B@29a81175] nullstring=\N
> lastColumnTakesRest=false
> shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered
> executor:
> Actor[akka.tcp://[email protected]:45248/user/Executor#-1002203295]
> with ID 27
> show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27
> disconnected, so removing it
> 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27
> (already removed): remote Akka client disassociated
> 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated:
> app-20140411190649-0008/27 is now FAILED (Command exited with code 53)
> 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor
> app-20140411190649-0008/27 removed: Command exited with code 53
> 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added:
> app-20140411190649-0008/28 on
> worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614
> (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores
> 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor
> ID app-20140411190649-0008/28 on hostPort
> ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM
> 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated:
> app-20140411190649-0008/28 is now RUNNING
> tables;
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]