[ 
https://issues.apache.org/jira/browse/KAFKA-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042372#comment-18042372
 ] 

PoAn Yang commented on KAFKA-19919:
-----------------------------------

[~nurzh4n] Thank you for reporting this issue. The `Fix Version/s` field is 
meant for versions where the PR has been merged, so I've removed the existing 
tags for now. We can certainly add them back once a PR is available.

> Network Threads Blocked by Synchronous Reverse DNS Lookups During Connection 
> Establishment
> ------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-19919
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19919
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, core, network, security
>    Affects Versions: 3.6.1
>            Reporter: Nurzhan
>            Priority: Major
>
> We had an issue with network threads described in 
> [https://lists.apache.org/thread/hr26jkgsg243s8oyy3gq5y84vv9stodv.]
> In short, the problem was with intermittent low network thread idle percent 
> due to high response send time, resulting in high response queue size.
> After some cycles of debugging and deploying the modified Kafka, we found out 
> that the problem was in blocking DNS queries during the preparation of new 
> connections.
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L548]
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java#L174]
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslServerAuthenticator.java#L209]
> Here, the KafkaChannel.prepare() call eventually results in calling 
> serverAddress().getHostName(), which does the reverse DNS lookup on the 
> network thread, blocking all the other connections assigned to the same 
> network thread. The DNS query is pretty simple; it just resolves the broker 
> hostname from IP, so we solved the problem of slow response send time by 
> adding the DNS record in /etc/hosts. Though I'm not sure why there's such a 
> query at all, adding the fixed record seems to fix the problem in our fairly 
> static environment (I cannot guarantee that it will work in other 
> environments). In our case, the network thread was blocked for 5 seconds due 
> to the default DNS query timeout in Linux.
> We are writing this issue for those who come after because we couldn't find 
> similar problems with Kafka on the web.
> In addition, maybe Kafka developers may consider some proposals:
> Adding networkThreadTimeNanos in 
> [https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/core/src/main/scala/kafka/network/RequestChannel.scala#L201C30-L201C52]
>  to the debug logs for each request, because it was hard to pinpoint the 
> problem when the only supporting metric was response send time, which 
> included the time to handle all connections during one selector poll
> Adding networkThreadTimeNanos to Selector.SelectorMetrics 
> (https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L1121)
> Maybe cache the DNS queries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to