otterc commented on a change in pull request #30312:
URL: https://github.com/apache/spark/pull/30312#discussion_r523304450
##########
File path:
common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
##########
@@ -254,7 +254,7 @@ TransportClient createClient(InetSocketAddress address)
// Disable Nagle's Algorithm since we don't want packets to wait
.option(ChannelOption.TCP_NODELAY, true)
.option(ChannelOption.SO_KEEPALIVE, true)
- .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, conf.connectionTimeoutMs())
+ .option(ChannelOption.CONNECT_TIMEOUT_MILLIS,
conf.connectionCreationTimeoutMs())
Review comment:
The `connectionTimeoutMs()` has been overloaded to serve as two
different types of timeout:
1. It's used as a connection idle timeout. My understanding is that the 120s
default was chosen to satisfy this timeout because in some cases the shuffle
server can take much longer to respond to a request than usual.
2. It is also used as connection creation timeout. The timeout to create the
connection should not be this high. This especially becomes a problem because
the thread which is creating a connection is also blocked for this specified
time when it tries to connect to a bad node
[here](https://github.com/apache/spark/blob/0046222a758fda2aead4a77bd19b19b913276693/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java#L283).
With push-based shuffle, an executor can be running a task that fetches
shuffle data from the same shuffle service to which another task is trying to
push the data. Since both push and fetch share the connections to the shuffle
server, whichever succeeds to get the connection lock
[here](https://github.com/apache/spark/blob/0046222a758fda2aead4a77bd19b19b913276693/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java#L198)
will block the other one. Blocking for a short time is okay bu
t the larger timeout causes the other to be blocked for a longer time. It
makes these tasks fail very slow. Also fetches are retried even when connection
fails to establish, so the higher timeout for creation increases the task
duration. One of the intentions here is that we didn't want the push side to
interfere in any way with the fetch side. When a task that pushes acquires the
connection lock before a task that wants to fetch from the same shuffle server
node, we want the push to relinquish the lock sooner. This is why the
connection creation timeout was introduced and we change the default in the
code to be smaller.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]