Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21346#discussion_r192565980
  
    --- Diff: 
common/network-common/src/main/java/org/apache/spark/network/server/RpcHandler.java
 ---
    @@ -38,15 +38,24 @@
        *
        * This method will not be called in parallel for a single 
TransportClient (i.e., channel).
        *
    +   * The rpc *might* included a data stream in <code>streamData</code> 
(eg. for uploading a large
    +   * amount of data which should not be buffered in memory here).  Any 
errors while handling the
    +   * streamData will lead to failing this entire connection -- all other 
in-flight rpcs will fail.
    --- End diff --
    
    I'm trying to think through whether we'll risk introducing any weird new 
failure modes (or increasing the occurrence of existing-but-improbable failure 
modes). For example, causing in-flight RPCs to fail could surface latent RPC 
timeout issues: if we have a timeout which is way too long and we drop 
in-flight responses on the floor without sending back negative ACKs then we 
could see (finite but potentially long) hangs.
    
    On the other hand, this pathway is used for executor <-> executor transfers 
and generally not executor <-> driver transfers, so my understanding is that 
failures in this channel generally won't impact control RPCs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to