[ https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070227#comment-14070227 ]
Twinkle Sachdeva commented on SPARK-2604: ----------------------------------------- I tried running in yarn-cluster mode. After setting property of spark.yarn.max.executor.failures to some number. Application do gets failed, but with misleading exception ( pasted at the end ). Instead of handling the condition this way, probably we should be doing the check for the overhead memory amount at the validation itself. Please share your thoughts, if you think otherwise. Stacktrace : Application application_1405933848949_0024 failed 2 times due to Error launching appattempt_1405933848949_0024_000002. Got exception: java.net.ConnectException: Call From NN46/192.168.156.46 to localhost:51322 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1414) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy28.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:118) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) > Spark Application hangs on yarn in edge case scenario of executor memory > requirement > ------------------------------------------------------------------------------------ > > Key: SPARK-2604 > URL: https://issues.apache.org/jira/browse/SPARK-2604 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.0.0 > Reporter: Twinkle Sachdeva > > In yarn environment, let's say : > MaxAM = Maximum allocatable memory > ExecMem - Executor's memory > if (MaxAM > ExecMem && ( MaxAM - ExecMem) > 384m )) > then Maximum resource validation fails w.r.t executor memory , and > application master gets launched, but when resource is allocated and again > validated, they are returned and application appears to be hanged. > Typical use case is to ask for executor memory = maximum allowed memory as > per yarn config -- This message was sent by Atlassian JIRA (v6.2#6252)