[ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706139#comment-14706139 ]
Baogang Wang commented on SPARK-10145: -------------------------------------- the spark-defaults.conf is as follows: Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" #spark.core.connection.ack.wait.timeout 3600 #spark.core.connection.auth.wait.timeout 3600 spark.akka.frameSize 1024 spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 spark.akka.timeout 900 spark.storage.memoryFraction 0.4 spark.rdd.compress true spark.shuffle.blockTransferService nio spark.yarn.executor.memoryOverhead 1024 > Executor exit without useful messages when spark runs in spark-streaming > ------------------------------------------------------------------------ > > Key: SPARK-10145 > URL: https://issues.apache.org/jira/browse/SPARK-10145 > Project: Spark > Issue Type: Bug > Components: Streaming, YARN > Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 > cores and 32g memory > Reporter: Baogang Wang > Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > Each node is allocated 30g memory by Yarn. > My application receives messages from Kafka by directstream. Each application > consists of 4 dstream window > Spark application is submitted by this command: > spark-submit --class spark_security.safe.SafeSockPuppet --driver-memory 3g > --executor-memory 3g --num-executors 3 --executor-cores 4 --name > safeSparkDealerUser --master yarn --deploy-mode cluster > spark_Security-1.0-SNAPSHOT.jar.nocalse > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties > After about 1 hours, some executor exits. There is no more yarn logs after > the executor exits and there is no stack when the executor exits. > When I see the yarn node manager log, it shows as follows : > 2015-08-17 17:25:41,550 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Start request for container_1439803298368_0005_01_000001 by user root > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Creating a new application reference for app application_1439803298368_0005 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root > IP=172.19.160.102 OPERATION=Start Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1439803298368_0005 > CONTAINERID=container_1439803298368_0005_01_000001 > 2015-08-17 17:25:41,551 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from NEW to INITING > 2015-08-17 17:25:41,552 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Adding container_1439803298368_0005_01_000001 to application > application_1439803298368_0005 > 2015-08-17 17:25:41,557 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > rollingMonitorInterval is set as -1. The log rolling mornitoring interval is > disabled. The logs will be aggregated after this application is finished. > 2015-08-17 17:25:41,663 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Application application_1439803298368_0005 transitioned from INITING to > RUNNING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_000001 transitioned from NEW to > LOCALIZING > 2015-08-17 17:25:41,664 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_INIT for appId application_1439803298368_0005 > 2015-08-17 17:25:41,664 INFO > org.apache.spark.network.yarn.YarnShuffleService: Initializing container > container_1439803298368_0005_01_000001 > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar > transitioned from INIT to DOWNLOADING > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar > transitioned from INIT to DOWNLOADING > 2015-08-17 17:25:41,665 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Created localizer for container_1439803298368_0005_01_000001 > 2015-08-17 17:25:41,668 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Writing credentials to the nmPrivate file > /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_000001.tokens. > Credentials list: > 2015-08-17 17:25:41,682 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: > Initializing user root > 2015-08-17 17:25:41,686 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying > from > /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_000001.tokens > to > /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000001.tokens > 2015-08-17 17:25:41,686 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Localizer > CWD set to > /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005 > = > file:/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005 > 2015-08-17 17:25:42,240 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar(->/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/filecache/14/spark-assembly-1.3.1-hadoop2.6.0.jar) > transitioned from DOWNLOADING to LOCALIZED > 2015-08-17 17:25:42,508 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar(->/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/filecache/15/spark_Security-1.0-SNAPSHOT.jar) > transitioned from DOWNLOADING to LOCALIZED > 2015-08-17 17:25:42,508 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_000001 transitioned from > LOCALIZING to LOCALIZED > 2015-08-17 17:25:42,548 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_000001 transitioned from LOCALIZED > to RUNNING > ................................................ > 2015-08-17 17:26:20,366 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Start request for container_1439803298368_0005_01_000003 by user root > 2015-08-17 17:26:20,367 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Adding container_1439803298368_0005_01_000003 to application > application_1439803298368_0005 > 2015-08-17 17:26:20,368 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_000003 transitioned from NEW to > LOCALIZING > 2015-08-17 17:26:20,368 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_INIT for appId application_1439803298368_0005 > 2015-08-17 17:26:20,368 INFO > org.apache.spark.network.yarn.YarnShuffleService: Initializing container > container_1439803298368_0005_01_000003 > 2015-08-17 17:26:20,369 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_000003 transitioned from > LOCALIZING to LOCALIZED > 2015-08-17 17:26:20,370 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root > IP=172.19.160.102 OPERATION=Start Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1439803298368_0005 > CONTAINERID=container_1439803298368_0005_01_000003 > 2015-08-17 17:26:20,443 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_000003 transitioned from LOCALIZED > to RUNNING > 2015-08-17 17:26:20,443 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Neither virutal-memory nor physical-memory monitoring is needed. Not running > the monitor-thread > 2015-08-17 17:26:20,449 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: > launchContainer: [bash, > /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000003/default_container_executor.sh] > .......................................... > > 2015-08-18 01:50:30,297 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container container_1439803298368_0005_01_000003 succeeded > 2015-08-18 01:50:30,440 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_000003 transitioned from RUNNING > to EXITED_WITH_SUCCESS > 2015-08-18 01:50:30,465 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1439803298368_0005_01_000003 > 2015-08-18 01:50:35,046 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root > OPERATION=Container Finished - Succeeded TARGET=ContainerImpl > RESULT=SUCCESS APPID=application_1439803298368_0005 > CONTAINERID=container_1439803298368_0005_01_000003 > 2015-08-18 01:50:35,062 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1439803298368_0005_01_000003 transitioned from > EXITED_WITH_SUCCESS to DONE > 2015-08-18 01:50:35,065 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Removing container_1439803298368_0005_01_000003 from application > application_1439803298368_0005 > 2015-08-18 01:50:35,070 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Neither virutal-memory nor physical-memory monitoring is needed. Not running > the monitor-thread > 2015-08-18 01:50:35,082 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1439803298368_0005_01_000003 for > log-aggregation > 2015-08-18 01:50:35,089 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_STOP for appId application_1439803298368_0005 > 2015-08-18 01:50:35,099 INFO > org.apache.spark.network.yarn.YarnShuffleService: Stopping container > container_1439803298368_0005_01_000003 > 2015-08-18 01:50:35,105 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000003 > 2015-08-18 01:50:47,601 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code > from container container_1439803298368_0005_01_000001 is : 15 > 2015-08-18 01:50:48,401 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception > from container-launch with container ID: > container_1439803298368_0005_01_000001 and exit code: 15 > ExitCodeException exitCode=15: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > container_1439803298368_0005_01_000003 was started at 2015-08-17 > 17:26:20. It ran normally. But it transitioned to succeed at 2015-08-18 > 01:50:30 . And it transitioned to CONTAINER_STOP in the end. > container_1439803298368_0005_01_000001 was started at 2015-08-17 17:25:42. At > 2015-08-18 01:50:48 it exited suddenly. > According to the node manager ,we can know that > container_1439803298368_0005_01_000003 transitioned from RUNNING to > EXITED_WITH_SUCCESS -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org