[
https://issues.apache.org/jira/browse/KAFKA-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088881#comment-18088881
]
Yunseop Eom edited comment on KAFKA-19919 at 6/14/26 1:13 PM:
--------------------------------------------------------------
[~shlomit]
Hi Shlomit, thanks for checking.
I opened a PR for this issue:
https://github.com/apache/kafka/pull/22569
The current commit is:
https://github.com/apache/kafka/pull/22569/commits/78fd8e5d60790b300f10bf2ea8d18a63b48a71f4
The fix keeps the scope narrow: built-in non-GSSAPI server mechanisms use an
unbound serverName, while custom non-GSSAPI mechanisms, the GSSAPI/Kerberos
path, and client-side SASL behavior preserve the existing behavior.
I also added focused coverage for built-in non-GSSAPI success/failure/exception
paths, multiple enabled mechanisms, custom mechanism compatibility,
GSSAPI/Kerberos behavior, unsupported mechanisms, and direct factory creation
with a null serverName.
[~ijuma] [~nurzh4n]
Maintainers, does this direction sound reasonable, or would you prefer a
different replacement for serverName?
was (Author: JIRAUSER313210):
[~shlomit]
Hi Shlomit, thanks for checking.
Yes, I’m still interested in working on this. I haven’t opened a PR yet because
I wanted to get maintainer feedback on the compatibility question around the
`serverName` argument before changing behavior.
My current plan is still to keep the fix narrowly scoped to the broker-side
non-GSSAPI SASL server path, while leaving GSSAPI/Kerberos and client-side SASL
behavior unchanged.
{{[~ijuma] [~nurzh4n] }}
{{Maintainers, does this direction sound reasonable, or would you prefer a
different replacement for `serverName`?}}
> Network Threads Blocked by Synchronous Reverse DNS Lookups During Connection
> Establishment
> ------------------------------------------------------------------------------------------
>
> Key: KAFKA-19919
> URL: https://issues.apache.org/jira/browse/KAFKA-19919
> Project: Kafka
> Issue Type: Bug
> Components: clients, core, network, security
> Affects Versions: 3.6.1
> Reporter: Nurzhan
> Priority: Major
>
> We had an issue with network threads described in
> [https://lists.apache.org/thread/hr26jkgsg243s8oyy3gq5y84vv9stodv.]
> In short, the problem was with intermittent low network thread idle percent
> due to high response send time, resulting in high response queue size.
> After some cycles of debugging and deploying the modified Kafka, we found out
> that the problem was in blocking DNS queries during the preparation of new
> connections.
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L548]
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java#L174]
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslServerAuthenticator.java#L209]
> Here, the KafkaChannel.prepare() call eventually results in calling
> serverAddress().getHostName(), which does the reverse DNS lookup on the
> network thread, blocking all the other connections assigned to the same
> network thread. The DNS query is pretty simple; it just resolves the broker
> hostname from IP, so we solved the problem of slow response send time by
> adding the DNS record in /etc/hosts. Though I'm not sure why there's such a
> query at all, adding the fixed record seems to fix the problem in our fairly
> static environment (I cannot guarantee that it will work in other
> environments). In our case, the network thread was blocked for 5 seconds due
> to the default DNS query timeout in Linux.
> We are writing this issue for those who come after because we couldn't find
> similar problems with Kafka on the web.
> In addition, maybe Kafka developers may consider some proposals:
> Adding networkThreadTimeNanos in
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/core/src/main/scala/kafka/network/RequestChannel.scala#L201C30-L201C52]
> to the debug logs for each request, because it was hard to pinpoint the
> problem when the only supporting metric was response send time, which
> included the time to handle all connections during one selector poll
> Adding networkThreadTimeNanos to Selector.SelectorMetrics
> (https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L1121)
> Maybe cache the DNS queries
--
This message was sent by Atlassian Jira
(v8.20.10#820010)