I think more about the potential wider use cases, I modified the proposal to 
target all the connection. Thanks.

- Best, - Cheng Tan

> On May 7, 2020, at 1:41 AM, Cheng Tan <c...@confluent.io> wrote:
> 
> Hi Colin,
> 
> Sorry for the confusion. I’m proposing to implement timeout in the 
> NetworkClient.leastLoadedNode() when iterating all the cached node. The 
> alternative I can think is to implement the timeout in NetworkClient.poll() 
> 
> I’d prefer to implement in the leastLoadedNode(). Here’re the reasons:
> Usually when clients send a request, they will asking the network client to 
> send the request to a specific node. In this case, the 
> connection.setup.timeout won’t matter too much because the client doesn’t 
> want to try other nodes for that specific request. The request level timeout 
> would be enough. The metadata fetcher fetches the nodes status periodically 
> so the clients can reassign the request to another node after timeout.
> Consumer, producer, and AdminClient are all using leastLoadedNode() for 
> metadata fetch, where the connection setup timeout can play an important 
> role. Unlike other requests can refer to the metadata for node condition, the 
> metadata requests can only blindly choose a node for retry in the worst 
> scenario. We want to make sure the client can get the metadata smoothly and 
> as soon as possible. As a result, we need this connection.setup.timeout.
> Implementing the timeout in poll() or anywhere else might need an extra 
> iteration of all nodes, which might downgrade the network client performance.
> I also updated the KIP content and KIP status. Please let me know if the 
> above ideas make sense. Thanks.
> 
> Best, - Cheng Tan
> 
> 
> 
>> On May 4, 2020, at 5:26 PM, Colin McCabe <cmcc...@apache.org 
>> <mailto:cmcc...@apache.org>> wrote:
>> 
>> Hi Cheng,
>> 
>> On the KIP page, it lists this KIP as "draft."  It seems like "under 
>> discussion" is appropriate here, right?
>> 
>>> Currently, the initial socket connection timeout is depending on Linux 
>>> kernel setting
>>> tcp_syn_retries. The timeout value is 2 ^ (tcp_sync_retries + 1) - 1 
>>> seconds. For the
>>> reasons below, we want to control the client-side socket timeout directly 
>>> using 
>>> configuration files
>> 
>> Linux is just one example of an OS that Kafka could run on, right?  You 
>> could also be running on MacOS, for example.
>> 
>>> I'm proposing to do a lazy socket connection time out. That is, we only 
>>> check if
>>> we need to timeout a socket when we consider the corresponding node as a 
>>> candidate in the node provider.
>> 
>> The NodeProvider is an AdminClient abstraction, right?  Why wouldn't we 
>> implement a connection setup timeout for all clients, not just AdminClient?
>> 
>> best,
>> Colin
>> 
>> On Mon, May 4, 2020, at 13:18, Colin McCabe wrote:
>>> Hmm.  A big part of the reason behind the KIP is that the default 
>>> connection timeout behavior of the OS doesn't work for Kafka, right?  
>>> For example, on Linux, if we wait 127 seconds for a connection attempt 
>>> to time out, we won't get a chance to make another attempt in most 
>>> cases.  So I think it makes sense to set a shorter default.
>>> 
>>> best,
>>> Colin
>>> 
>>> 
>>> On Mon, May 4, 2020, at 09:44, Jose Garcia Sancio wrote:
>>>> Thanks for the KIP Cheng,
>>>> 
>>>>> The default value will be 10 seconds.
>>>> 
>>>> I think we should make the default the current behavior. Meaning the
>>>> default should leverage the default connect timeout from the operating
>>>> system.
>>>> 
>>>>> Proposed Changes
>>>> 
>>>> I don't fully understand this section. It seems like it is mainly
>>>> focused on the problem with the current implementation. Can you
>>>> explain how the proposed changes solve the problem?
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>> -- 
>>>> -Jose
>>>> 
>>> 
> 

Reply via email to