[ 
https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958103#comment-14958103
 ] 

Marcelo Vanzin commented on SPARK-11098:
----------------------------------------

So, while working on another patch in this area, I ran into this issue, and I 
don't think it's a problem in the RPC layer, but rather a problem of the code 
calling the RPC layer.

Even if somehow you synchronize things in the RPC env implementation so that 
RPCs are sent in the order they arrive, there are multiple threads that can be 
calling {{RpcEndpoint.send()}} or {{RpcEndpoint.ask()}} at the same time, and 
at that point there's not guarantee of any order.

The problem I ran into explicitly was the Worker ignoring messages from the 
Master because it thought the master was not active. That's because those 
messages were arriving before the master had replied to the Worker's 
registration message. That's not the fault of the RPC layer, that's the fault 
of that reply being sent to the Worker as a separate message, instead of an RPC 
reply to the {{RegisterWorker}} message. {{Worker}} in this case should be 
using {{ask}} and getting the reply from that ask; that ensures the reply will 
arrive before any other messages the Master may want to send to the worker.

If you want to see how to do that properly, see how 
{{CoarseGrainedExecutorBackend}} does its registration with the scheduler using 
{{ask}} instead of {{send}}.

Anyway, I have that fixed in my patch, I might take it out as a separate fix 
and attach it to this bug. But I'm not sure if other areas of the code don't 
suffer from the same problem.

> RPC message ordering is not guaranteed
> --------------------------------------
>
>                 Key: SPARK-11098
>                 URL: https://issues.apache.org/jira/browse/SPARK-11098
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>            Reporter: Reynold Xin
>
> NettyRpcEnv doesn't guarantee message delivery order since there are multiple 
> threads sending messages in clientConnectionExecutor thread pool. We should 
> fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to