[
https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marcelo Vanzin resolved SPARK-10145.
------------------------------------
Resolution: Unresolved
I'm closing this since a lot in this area has changed since this bug was
reported. If issues like this still happen they're bound to look a lot
different, so we'd need new (and more complete) logs to debug them.
> Executor exit without useful messages when spark runs in spark-streaming
> ------------------------------------------------------------------------
>
> Key: SPARK-10145
> URL: https://issues.apache.org/jira/browse/SPARK-10145
> Project: Spark
> Issue Type: Bug
> Components: DStreams, YARN
> Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32
> cores and 32g memory
> Reporter: Baogang Wang
> Priority: Critical
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Each node is allocated 30g memory by Yarn.
> My application receives messages from Kafka by directstream. Each application
> consists of 4 dstream window
> Spark application is submitted by this command:
> spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g
> --executor-memory 3g --num-executors 3 --executor-cores 4 --name
> safeSparkDealerUser --master yarn --deploy-mode cluster
> spark_Security-1.0-SNAPSHOT.jar.nocalse
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties
> After about 1 hours, some executor exits. There is no more yarn logs after
> the executor exits and there is no stack when the executor exits.
> When I see the yarn node manager log, it shows as follows :
> 2015-08-17 17:25:41,550 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> Start request for container_1439803298368_0005_01_000001 by user root
> 2015-08-17 17:25:41,551 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> Creating a new application reference for app application_1439803298368_0005
> 2015-08-17 17:25:41,551 INFO
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root
> IP=172.19.160.102 OPERATION=Start Container Request
> TARGET=ContainerManageImpl RESULT=SUCCESS
> APPID=application_1439803298368_0005
> CONTAINERID=container_1439803298368_0005_01_000001
> 2015-08-17 17:25:41,551 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
> Application application_1439803298368_0005 transitioned from NEW to INITING
> 2015-08-17 17:25:41,552 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
> Adding container_1439803298368_0005_01_000001 to application
> application_1439803298368_0005
> 2015-08-17 17:25:41,557 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> rollingMonitorInterval is set as -1. The log rolling mornitoring interval is
> disabled. The logs will be aggregated after this application is finished.
> 2015-08-17 17:25:41,663 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
> Application application_1439803298368_0005 transitioned from INITING to
> RUNNING
> 2015-08-17 17:25:41,664 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1439803298368_0005_01_000001 transitioned from NEW to
> LOCALIZING
> 2015-08-17 17:25:41,664 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got
> event CONTAINER_INIT for appId application_1439803298368_0005
> 2015-08-17 17:25:41,664 INFO
> org.apache.spark.network.yarn.YarnShuffleService: Initializing container
> container_1439803298368_0005_01_000001
> 2015-08-17 17:25:41,665 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
> Resource
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar
> transitioned from INIT to DOWNLOADING
> 2015-08-17 17:25:41,665 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
> Resource
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar
> transitioned from INIT to DOWNLOADING
> 2015-08-17 17:25:41,665 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Created localizer for container_1439803298368_0005_01_000001
> 2015-08-17 17:25:41,668 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Writing credentials to the nmPrivate file
> /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_000001.tokens.
> Credentials list:
> 2015-08-17 17:25:41,682 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
> Initializing user root
> 2015-08-17 17:25:41,686 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying
> from
> /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_000001.tokens
> to
> /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000001.tokens
> 2015-08-17 17:25:41,686 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Localizer
> CWD set to
> /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005
> =
> file:/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005
> 2015-08-17 17:25:42,240 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
> Resource
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar(->/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/filecache/14/spark-assembly-1.3.1-hadoop2.6.0.jar)
> transitioned from DOWNLOADING to LOCALIZED
> 2015-08-17 17:25:42,508 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
> Resource
> hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar(->/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/filecache/15/spark_Security-1.0-SNAPSHOT.jar)
> transitioned from DOWNLOADING to LOCALIZED
> 2015-08-17 17:25:42,508 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1439803298368_0005_01_000001 transitioned from
> LOCALIZING to LOCALIZED
> 2015-08-17 17:25:42,548 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1439803298368_0005_01_000001 transitioned from LOCALIZED
> to RUNNING
> ................................................
> 2015-08-17 17:26:20,366 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> Start request for container_1439803298368_0005_01_000003 by user root
> 2015-08-17 17:26:20,367 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
> Adding container_1439803298368_0005_01_000003 to application
> application_1439803298368_0005
> 2015-08-17 17:26:20,368 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1439803298368_0005_01_000003 transitioned from NEW to
> LOCALIZING
> 2015-08-17 17:26:20,368 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got
> event CONTAINER_INIT for appId application_1439803298368_0005
> 2015-08-17 17:26:20,368 INFO
> org.apache.spark.network.yarn.YarnShuffleService: Initializing container
> container_1439803298368_0005_01_000003
> 2015-08-17 17:26:20,369 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1439803298368_0005_01_000003 transitioned from
> LOCALIZING to LOCALIZED
> 2015-08-17 17:26:20,370 INFO
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root
> IP=172.19.160.102 OPERATION=Start Container Request
> TARGET=ContainerManageImpl RESULT=SUCCESS
> APPID=application_1439803298368_0005
> CONTAINERID=container_1439803298368_0005_01_000003
> 2015-08-17 17:26:20,443 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1439803298368_0005_01_000003 transitioned from LOCALIZED
> to RUNNING
> 2015-08-17 17:26:20,443 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Neither virutal-memory nor physical-memory monitoring is needed. Not running
> the monitor-thread
> 2015-08-17 17:26:20,449 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
> launchContainer: [bash,
> /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000003/default_container_executor.sh]
> ..........................................
>
> 2015-08-18 01:50:30,297 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Container container_1439803298368_0005_01_000003 succeeded
> 2015-08-18 01:50:30,440 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1439803298368_0005_01_000003 transitioned from RUNNING
> to EXITED_WITH_SUCCESS
> 2015-08-18 01:50:30,465 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
> Cleaning up container container_1439803298368_0005_01_000003
> 2015-08-18 01:50:35,046 INFO
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root
> OPERATION=Container Finished - Succeeded TARGET=ContainerImpl
> RESULT=SUCCESS APPID=application_1439803298368_0005
> CONTAINERID=container_1439803298368_0005_01_000003
> 2015-08-18 01:50:35,062 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1439803298368_0005_01_000003 transitioned from
> EXITED_WITH_SUCCESS to DONE
> 2015-08-18 01:50:35,065 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
> Removing container_1439803298368_0005_01_000003 from application
> application_1439803298368_0005
> 2015-08-18 01:50:35,070 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Neither virutal-memory nor physical-memory monitoring is needed. Not running
> the monitor-thread
> 2015-08-18 01:50:35,082 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Considering container container_1439803298368_0005_01_000003 for
> log-aggregation
> 2015-08-18 01:50:35,089 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got
> event CONTAINER_STOP for appId application_1439803298368_0005
> 2015-08-18 01:50:35,099 INFO
> org.apache.spark.network.yarn.YarnShuffleService: Stopping container
> container_1439803298368_0005_01_000003
> 2015-08-18 01:50:35,105 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
> absolute path :
> /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000003
> 2015-08-18 01:50:47,601 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code
> from container container_1439803298368_0005_01_000001 is : 15
> 2015-08-18 01:50:48,401 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception
> from container-launch with container ID:
> container_1439803298368_0005_01_000001 and exit code: 15
> ExitCodeException exitCode=15:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> container_1439803298368_0005_01_000003 was started at 2015-08-17
> 17:26:20. It ran normally. But it transitioned to succeed at 2015-08-18
> 01:50:30 . And it transitioned to CONTAINER_STOP in the end.
> container_1439803298368_0005_01_000001 was started at 2015-08-17 17:25:42. At
> 2015-08-18 01:50:48 it exited suddenly.
> According to the node manager ,we can know that
> container_1439803298368_0005_01_000003 transitioned from RUNNING to
> EXITED_WITH_SUCCESS
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]