[ 
https://issues.apache.org/jira/browse/KAFKA-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088881#comment-18088881
 ] 

Yunseop Eom edited comment on KAFKA-19919 at 6/14/26 1:13 PM:
--------------------------------------------------------------

[~shlomit]

Hi Shlomit, thanks for checking.

I opened a PR for this issue:

https://github.com/apache/kafka/pull/22569

The current commit is:

https://github.com/apache/kafka/pull/22569/commits/78fd8e5d60790b300f10bf2ea8d18a63b48a71f4

The fix keeps the scope narrow: built-in non-GSSAPI server mechanisms use an 
unbound serverName, while custom non-GSSAPI mechanisms, the GSSAPI/Kerberos 
path, and client-side SASL behavior preserve the existing behavior.

I also added focused coverage for built-in non-GSSAPI success/failure/exception 
paths, multiple enabled mechanisms, custom mechanism compatibility, 
GSSAPI/Kerberos behavior, unsupported mechanisms, and direct factory creation 
with a null serverName.

[~ijuma] [~nurzh4n]

Maintainers, does this direction sound reasonable, or would you prefer a 
different replacement for serverName?


was (Author: JIRAUSER313210):
[~shlomit] 

Hi Shlomit, thanks for checking.

Yes, I’m still interested in working on this. I haven’t opened a PR yet because 
I wanted to get maintainer feedback on the compatibility question around the 
`serverName` argument before changing behavior.

My current plan is still to keep the fix narrowly scoped to the broker-side 
non-GSSAPI SASL server path, while leaving GSSAPI/Kerberos and client-side SASL 
behavior unchanged.

{{[~ijuma] [~nurzh4n] }}

{{Maintainers, does this direction sound reasonable, or would you prefer a 
different replacement for `serverName`?}}

 

 

> Network Threads Blocked by Synchronous Reverse DNS Lookups During Connection 
> Establishment
> ------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-19919
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19919
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, core, network, security
>    Affects Versions: 3.6.1
>            Reporter: Nurzhan
>            Priority: Major
>
> We had an issue with network threads described in 
> [https://lists.apache.org/thread/hr26jkgsg243s8oyy3gq5y84vv9stodv.]
> In short, the problem was with intermittent low network thread idle percent 
> due to high response send time, resulting in high response queue size.
> After some cycles of debugging and deploying the modified Kafka, we found out 
> that the problem was in blocking DNS queries during the preparation of new 
> connections.
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L548]
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java#L174]
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslServerAuthenticator.java#L209]
> Here, the KafkaChannel.prepare() call eventually results in calling 
> serverAddress().getHostName(), which does the reverse DNS lookup on the 
> network thread, blocking all the other connections assigned to the same 
> network thread. The DNS query is pretty simple; it just resolves the broker 
> hostname from IP, so we solved the problem of slow response send time by 
> adding the DNS record in /etc/hosts. Though I'm not sure why there's such a 
> query at all, adding the fixed record seems to fix the problem in our fairly 
> static environment (I cannot guarantee that it will work in other 
> environments). In our case, the network thread was blocked for 5 seconds due 
> to the default DNS query timeout in Linux.
> We are writing this issue for those who come after because we couldn't find 
> similar problems with Kafka on the web.
> In addition, maybe Kafka developers may consider some proposals:
> Adding networkThreadTimeNanos in 
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/core/src/main/scala/kafka/network/RequestChannel.scala#L201C30-L201C52]
>  to the debug logs for each request, because it was hard to pinpoint the 
> problem when the only supporting metric was response send time, which 
> included the time to handle all connections during one selector poll
> Adding networkThreadTimeNanos to Selector.SelectorMetrics 
> (https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L1121)
> Maybe cache the DNS queries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to