Hi Colin,

Sorry for the confusion. I’m proposing to implement timeout in the 
NetworkClient.leastLoadedNode() when iterating all the cached node. The 
alternative I can think is to implement the timeout in NetworkClient.poll() 

I’d prefer to implement in the leastLoadedNode(). Here’re the reasons:
Usually when clients send a request, they will asking the network client to 
send the request to a specific node. In this case, the connection.setup.timeout 
won’t matter too much because the client doesn’t want to try other nodes for 
that specific request. The request level timeout would be enough. The metadata 
fetcher fetches the nodes status periodically so the clients can reassign the 
request to another node after timeout.
Consumer, producer, and AdminClient are all using leastLoadedNode() for 
metadata fetch, where the connection setup timeout can play an important role. 
Unlike other requests can refer to the metadata for node condition, the 
metadata requests can only blindly choose a node for retry in the worst 
scenario. We want to make sure the client can get the metadata smoothly and as 
soon as possible. As a result, we need this connection.setup.timeout.
Implementing the timeout in poll() or anywhere else might need an extra 
iteration of all nodes, which might downgrade the network client performance.
I also updated the KIP content and KIP status. Please let me know if the above 
ideas make sense. Thanks.

Best, - Cheng Tan



> On May 4, 2020, at 5:26 PM, Colin McCabe <cmcc...@apache.org> wrote:
> 
> Hi Cheng,
> 
> On the KIP page, it lists this KIP as "draft."  It seems like "under 
> discussion" is appropriate here, right?
> 
>> Currently, the initial socket connection timeout is depending on Linux 
>> kernel setting
>> tcp_syn_retries. The timeout value is 2 ^ (tcp_sync_retries + 1) - 1 
>> seconds. For the
>> reasons below, we want to control the client-side socket timeout directly 
>> using 
>> configuration files
> 
> Linux is just one example of an OS that Kafka could run on, right?  You could 
> also be running on MacOS, for example.
> 
>> I'm proposing to do a lazy socket connection time out. That is, we only 
>> check if
>> we need to timeout a socket when we consider the corresponding node as a 
>> candidate in the node provider.
> 
> The NodeProvider is an AdminClient abstraction, right?  Why wouldn't we 
> implement a connection setup timeout for all clients, not just AdminClient?
> 
> best,
> Colin
> 
> On Mon, May 4, 2020, at 13:18, Colin McCabe wrote:
>> Hmm.  A big part of the reason behind the KIP is that the default 
>> connection timeout behavior of the OS doesn't work for Kafka, right?  
>> For example, on Linux, if we wait 127 seconds for a connection attempt 
>> to time out, we won't get a chance to make another attempt in most 
>> cases.  So I think it makes sense to set a shorter default.
>> 
>> best,
>> Colin
>> 
>> 
>> On Mon, May 4, 2020, at 09:44, Jose Garcia Sancio wrote:
>>> Thanks for the KIP Cheng,
>>> 
>>>> The default value will be 10 seconds.
>>> 
>>> I think we should make the default the current behavior. Meaning the
>>> default should leverage the default connect timeout from the operating
>>> system.
>>> 
>>>> Proposed Changes
>>> 
>>> I don't fully understand this section. It seems like it is mainly
>>> focused on the problem with the current implementation. Can you
>>> explain how the proposed changes solve the problem?
>>> 
>>> Thanks.
>>> 
>>> 
>>> -- 
>>> -Jose
>>> 
>> 

Reply via email to