[ 
https://issues.apache.org/jira/browse/SPARK-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947893#comment-14947893
 ] 

Marcelo Vanzin commented on SPARK-10987:
----------------------------------------

Anyway, here's what I found so far.

Driver launches AM; AM connects back to driver and sends stuff. But driver 
never sends any messages to AM. That means that in 
{{NettyRpcHandler::connectionTerminated}}, the {{Disassociated}} message is not 
sent, because since no message was sent *to* to the AM, the code in 
{{NettyRpcHandler::receive}} was never run, so the driver connection was never 
recorded.

So there must be a way for {{NettyRpcHandler}} to know when outgoing 
connections are killed, not just incoming ones.

In a way this is caused by the code trying to mimic what akka does, but failing 
at it; since the AM is purely a client, it shouldn't need to listen for 
connections and rely on incoming connections for anything - it should be able 
to register itself and do everything using the client socket it opened. That's 
probably going to be tricky to fix, though.

> yarn-client mode misbehaving with netty-based RPC backend
> ---------------------------------------------------------
>
>                 Key: SPARK-10987
>                 URL: https://issues.apache.org/jira/browse/SPARK-10987
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, YARN
>    Affects Versions: 1.6.0
>            Reporter: Marcelo Vanzin
>            Priority: Blocker
>
> YARN running in cluster deploy mode seems to be having issues with the new 
> RPC backend; if you look at unit test runs, tests that run in cluster mode 
> are taking several minutes to run, instead of the more usual 20-30 seconds.
> For example, 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull:
> {noformat}
> [info] YarnClusterSuite:
> [info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds)
> [info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds)
> [info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds)
> [info] - run Python application in yarn-client mode (21 seconds, 842 
> milliseconds)
> [info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds)
> [info] - user class path first in client mode (1 minute, 58 seconds)
> [info] - user class path first in cluster mode (4 minutes, 49 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to