[ https://issues.apache.org/jira/browse/SPARK-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947893#comment-14947893 ]
Marcelo Vanzin commented on SPARK-10987: ---------------------------------------- Anyway, here's what I found so far. Driver launches AM; AM connects back to driver and sends stuff. But driver never sends any messages to AM. That means that in {{NettyRpcHandler::connectionTerminated}}, the {{Disassociated}} message is not sent, because since no message was sent *to* to the AM, the code in {{NettyRpcHandler::receive}} was never run, so the driver connection was never recorded. So there must be a way for {{NettyRpcHandler}} to know when outgoing connections are killed, not just incoming ones. In a way this is caused by the code trying to mimic what akka does, but failing at it; since the AM is purely a client, it shouldn't need to listen for connections and rely on incoming connections for anything - it should be able to register itself and do everything using the client socket it opened. That's probably going to be tricky to fix, though. > yarn-client mode misbehaving with netty-based RPC backend > --------------------------------------------------------- > > Key: SPARK-10987 > URL: https://issues.apache.org/jira/browse/SPARK-10987 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN > Affects Versions: 1.6.0 > Reporter: Marcelo Vanzin > Priority: Blocker > > YARN running in cluster deploy mode seems to be having issues with the new > RPC backend; if you look at unit test runs, tests that run in cluster mode > are taking several minutes to run, instead of the more usual 20-30 seconds. > For example, > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull: > {noformat} > [info] YarnClusterSuite: > [info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds) > [info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds) > [info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds) > [info] - run Python application in yarn-client mode (21 seconds, 842 > milliseconds) > [info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds) > [info] - user class path first in client mode (1 minute, 58 seconds) > [info] - user class path first in cluster mode (4 minutes, 49 seconds) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org