[
https://issues.apache.org/jira/browse/KAFKA-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042391#comment-18042391
]
Ismael Juma commented on KAFKA-19919:
-------------------------------------
Looks related to KAFKA-8562.
> Network Threads Blocked by Synchronous Reverse DNS Lookups During Connection
> Establishment
> ------------------------------------------------------------------------------------------
>
> Key: KAFKA-19919
> URL: https://issues.apache.org/jira/browse/KAFKA-19919
> Project: Kafka
> Issue Type: Bug
> Components: clients, core, network, security
> Affects Versions: 3.6.1
> Reporter: Nurzhan
> Priority: Major
>
> We had an issue with network threads described in
> [https://lists.apache.org/thread/hr26jkgsg243s8oyy3gq5y84vv9stodv.]
> In short, the problem was with intermittent low network thread idle percent
> due to high response send time, resulting in high response queue size.
> After some cycles of debugging and deploying the modified Kafka, we found out
> that the problem was in blocking DNS queries during the preparation of new
> connections.
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L548]
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java#L174]
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslServerAuthenticator.java#L209]
> Here, the KafkaChannel.prepare() call eventually results in calling
> serverAddress().getHostName(), which does the reverse DNS lookup on the
> network thread, blocking all the other connections assigned to the same
> network thread. The DNS query is pretty simple; it just resolves the broker
> hostname from IP, so we solved the problem of slow response send time by
> adding the DNS record in /etc/hosts. Though I'm not sure why there's such a
> query at all, adding the fixed record seems to fix the problem in our fairly
> static environment (I cannot guarantee that it will work in other
> environments). In our case, the network thread was blocked for 5 seconds due
> to the default DNS query timeout in Linux.
> We are writing this issue for those who come after because we couldn't find
> similar problems with Kafka on the web.
> In addition, maybe Kafka developers may consider some proposals:
> Adding networkThreadTimeNanos in
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/core/src/main/scala/kafka/network/RequestChannel.scala#L201C30-L201C52]
> to the debug logs for each request, because it was hard to pinpoint the
> problem when the only supporting metric was response send time, which
> included the time to handle all connections during one selector poll
> Adding networkThreadTimeNanos to Selector.SelectorMetrics
> (https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L1121)
> Maybe cache the DNS queries
--
This message was sent by Atlassian Jira
(v8.20.10#820010)