Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/21346#discussion_r191981552
--- Diff:
common/network-common/src/main/java/org/apache/spark/network/server/RpcHandler.java
---
@@ -38,15 +38,24 @@
*
* This method will not be called in parallel for a single
TransportClient (i.e., channel).
*
+ * The rpc *might* included a data stream in <code>streamData</code>
(eg. for uploading a large
+ * amount of data which should not be buffered in memory here). Any
errors while handling the
+ * streamData will lead to failing this entire connection -- all other
in-flight rpcs will fail.
--- End diff --
pretty good question actually :)
I will take a closer look at this myself but I believe this connection is
shared by other tasks running on the same executor which are trying to talk to
the same destination. So that might mean another task which is replicating to
the same destination, or reading data from that same remote executor. those
don't have specific retry behavior for connection closed -- that might result
in the data just not getting replicated, fetching data from elsewhere, or the
task getting retried.
I think this is actually OK -- the existing code could cause an OOM on the
remote end anyway, which obviously would fail a lot more. This failure
behavior seems reasonable.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]