Re: [DISCUSS] KIP-601: Configurable socket connection timeout

Colin McCabe Tue, 19 May 2020 13:49:41 -0700

On Tue, May 19, 2020, at 03:27, Rajini Sivaram wrote:
> Hi Colin,
> 
> I do agree about the `leastLoadedNode` case. My question was about the
> other cases where we are connecting to a specific node: fetch requests to
> leaders, produce requests to leaders, requests to group coordinators,
> requests to controller etc. It will be good to either quantify that these
> connections are less common and hence less critical in terms of performance
> in typical deployments or describe the impact on these connections from the
> proposed change in default behaviour. It is perfectly fine if connections
> to specific nodes don't benefit from the new timeout, I was looking for
> analysis which says they aren't made any worse either, especially in the
> context of other connection rate limiting/quota work we are proposing like
> KIP-612.
>


Hi Rajini,

This is a fair point.  In the VOTE thread, I proposed using an exponential 
connection retry backoff to mitigate this problem.  So the first few retries 
would happen quickly, but later retries would take increasingly longer, keeping 
the number of reconnect attempts down.

(This is assuming we're trying to connect to a single fixed node, like the 
controller node)

best,
Colin


>
> Regards,
> 
> Rajini
> 
> 
> On Mon, May 18, 2020 at 8:48 PM Colin McCabe <cmcc...@apache.org> wrote:
> 
> > Hi Rajini,
> >
> > I think the idea behind the 10 second default is that if you have three
> > Kafka nodes A, B, C (or whatever), and you can't talk to A within 10
> > seconds, you'll try again with B or C, and still have plenty of time left
> > over.  Whereas currently, if your connection hangs while trying to connect
> > to A, you're out of luck-- you'll just hang until the whole request timeout
> > is gone.  So while you could have tried a different node and succeeded, you
> > never got a chance to.
> >
> > So in the common case where you have other nodes that you can connect to,
> > we won't end up trying to reconnect to the same node over and over.  I'll
> > add some more comments in the vote thread.
> >
> > best,
> > Colin
> >
> >
> > On Fri, May 15, 2020, at 14:13, Rajini Sivaram wrote:
> > > Hi Cheng,
> > >
> > > I am fine with the rest of the KIP apart from the 10s default. If no one
> > > else has any concerns about this new default, let's go with it. Please go
> > > ahead and start vote.
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > >
> > > On Fri, May 15, 2020 at 8:21 PM Cheng Tan <c...@confluent.io> wrote:
> > >
> > > > Dear Rajini,
> > > >
> > > >
> > > > Thanks for the reply.
> > > >
> > > > > e have a lot of these and I want to
> > > > > understand the benefits of the proposed timeout in this case alone.
> > We
> > > > > currently have a request timeout of 30s. Would you consider adding a
> > 10s
> > > > > connection timeout?
> > > >
> > > > A shorter timeout (10s) at the transportation level will help clients
> > > > detect dead nodes faster. “request.timeout.ms” is too general and
> > applies
> > > > to all the requests whose complexity at the application level varies.
> > It’s
> > > > risky to set “request.timeout.ms” to a lower value for detecting dead
> > > > nodes quicker because of the involvement of the application layer.
> > > >
> > > > After “socket.connection.setup.timeout.ms” hits, NetworkClient will
> > fail
> > > > the request in the exact approach as it handles “request.timeout.ms”.
> > > > That is to say, the response will constructed upon a
> > RetriableException.
> > > > Producer, Consumer, and KafkaAdminClient can then perform their retry
> > logic
> > > > as a request timeout happens.
> > > >
> > > > > We have KIP-612 that is proposing to throttle connection set up on
> > the
> > > > one
> > > > > hand and this KIP that is dramatically reducing default connection
> > > > timeout
> > > > > on the other. Not sure if that is a good idea.
> > > >
> > > > The default of the broker connection creation rate limit is
> > Int.MaxValue.
> > > > The KIP also proposes per-IP throttle configuration. Thus, I don’t
> > expect
> > > > the combination of the broker connection throttle and a shorter client
> > > > transportation timeout will have a side effect.
> > > >
> > > > Does the reasons above make sense to you?
> > > >
> > > > Best, - Cheng
> > > >
> > > >
> > > >
> > > >
> > > > > On May 15, 2020, at 4:49 AM, Rajini Sivaram <rajinisiva...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > Hi Cheng,
> > > > >
> > > > > Let me rephrase my question. Let's say we didn't have the case of
> > > > > leastLoadedNode. We are only talking about connections to a specific
> > node
> > > > > (i.e. leader or controller). We have a lot of these and I want to
> > > > > understand the benefits of the proposed timeout in this case alone.
> > We
> > > > > currently have a request timeout of 30s. Would you consider adding a
> > 10s
> > > > > connection timeout? And if you did, what would you expect the 10s
> > timeout
> > > > > to do?
> > > > >
> > > > > a) We could fail a request if connection didn't complete within 10s.
> > If
> > > > we
> > > > > always expect connections to succeed within 10s, this would be
> > considered
> > > > > reasonable behaviour. But this would be changing the current default,
> > > > which
> > > > > allows you up to 30 seconds to connect and process a request.
> > > > > b) We retry the connection. What would be the point? We were waiting
> > in a
> > > > > queue for connecting, but we decide to stop and join the back of the
> > > > queue.
> > > > >
> > > > > We have KIP-612 that is proposing to throttle connection set up on
> > the
> > > > one
> > > > > hand and this KIP that is dramatically reducing default connection
> > > > timeout
> > > > > on the other. Not sure if that is a good idea.
> > > > >
> > > > >
> > > > > On Fri, May 15, 2020 at 1:26 AM Cheng Tan <c...@confluent.io> wrote:
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-601: Configurable socket connection timeout

Reply via email to