+1 (binding) Thank you for the contribution.
On Wed, Jun 3, 2020 at 12:53 AM Cheng Tan <c...@confluent.io> wrote: > Dear Rajini, > > Thanks for the feedback. > > 1) > Because "request.timeout.ms" only affects in-flight requests, after the > API NetworkClient.ready() is invoked, the connection won't get closed after > "request.timeout.ms” hits. Before > a) the SocketChannel is connected > b) ssl handshake finished > c) authentication has finished (sasl) > clients cannot invoke NetworkClient.send() to send any request, which > means no in-flight request targeting to the connection will be added. > > > 2) > I think a default value of 127 seconds make sense, which meets the timeout > indirectly specified by the default value of “tcp.syn.retries”. I’ve added > this into the KIP proposal. > > > 3) > Every time the timeout hits, the timeout value of the next connection try > will increase. > > The timeout will hit iff a connection stays at the `connecting` state > longer than the timeout value, as indicated by > ClusterConnectionStates.NodeConnectionState. The connection state of a node > will change iff `SelectionKey.OP_CONNECT` is detected by > `nioSelector.Select()`. The connection state may transit from `connecting` > to > > a) `disconnected` when SocketChannel.finishConnect() throws > IOException. > b) `connected` when SocketChannel.finishConnect() return TRUE. > > In other words, the timeout will hit and increase iff the interested > SelectionKey.OP_CONNECT doesn't trigger before the timeout arrives, which > means, for example, network congestion, failure of the ARP request, packet > filtering, routing error, or a silent discard may happen. (I didn’t read > the Java NIO source code. Please correct me the case when OP_CONNECT won’t > get triggered if I’m wrong) > > > 4) > > A) Connection timeout dominates both request timeout and API timeout > > When connection timeout hits, the connection will be closed. The client > will be notified either by the responses constructed by NetworkClient or > the callbacks attached to the request. As a result, the request failure > will be handled before either connection timeout or API timeout arrives. > > > B) Neither request timeout or API timeout dominates connection timeout > > i) Request timeout: Because request timeout only affects in-flight > requests, after the API NetworkClient.ready() is invoked, the connection > won't get closed after "request.timeout.ms” hits. Before > 1. the SocketChannel is connected > 2. SSL handshake finished > 3. authentication has finished (SASL) > , clients won't be able to invoke NetworkClient.send() to send any > request, which means no in-flight request targeting to the connection will > be added. > > ii) API timeout: In AdminClient, API timeout acts by putting a smaller and > smaller timeout value to the chain of requests in a same API. After the API > timeout hits, the retry logic won't close any connection. In consumer, API > timeout acts as a whole by putting a limit to the code block executing > time. The retry logic won't close any connection as well. > > > Conclusion: > > Thanks again for the long feedback and I’m always enjoying them. I’ve > supplement the above discussion into the KIP proposal. Please let me know > what you think. > > > Best, - Cheng Tan > > > > On Jun 2, 2020, at 3:01 AM, Rajini Sivaram <rajinisiva...@gmail.com> > wrote: > > > > Hi Cheng, > > > > Not sure if the discussion should move back to the DISCUSS thread. I > have a > > few questions: > > > > 1) The KIP motivation says that in some cases `request.timeout.ms` > doesn't > > timeout connections properly and as a result it takes 127s to detect a > > connection failure. This sounds like a bug rather than a limitation of > the > > current approach. Can you explain the scenarios where this occurs? > > > > 2) I think the current proposal is to use non-exponential 10s connection > > timeout as default with the option to use exponential timeout. So > > connection timeouts for every connection attempt will be between 8s and > 12s > > by default. Is that correct? Should we use a default max timeout to > enable > > exponential timeout by default since 8s seems rather small? > > > > 3) What is the scope of `failures` used to determine connection timeout > > with exponential timeouts? Will we always use 10s followed by 20s every > > time a connection is attempted? > > > > 4) It will be good if we can include two flows with the relationship > > between various timeouts in the KIP. One with a fixed node like a typical > > produce/consume request to the leader and another that uses > > `leastLoadedNode` like a metadata request. Having the comparison between > > the current and proposed behaviour w.r.t all configurable timeouts (the > two > > new connection timeouts, request timeout, api timeout etc.) will be > useful. > > > > Regards, > > > > Rajini > > > -- Gwen Shapira Engineering Manager | Confluent 650.450.2760 | @gwenshap Follow us: Twitter | blog