Nurzhan created KAFKA-19919:
-------------------------------

             Summary: Network Threads Blocked by Synchronous Reverse DNS 
Lookups During Connection Establishment
                 Key: KAFKA-19919
                 URL: https://issues.apache.org/jira/browse/KAFKA-19919
             Project: Kafka
          Issue Type: Bug
          Components: clients, core, network, security
    Affects Versions: 3.6.1
            Reporter: Nurzhan
             Fix For: 3.8.2, 3.9.2, 4.1.0, 4.0.1, 4.0.0, 3.9.1, 3.9.0, 3.8.1, 
3.8.0, 3.7.2, 3.7.1, 3.7.0, 3.6.2, 3.6.1


We had an issue with network threads described in 
[https://lists.apache.org/thread/hr26jkgsg243s8oyy3gq5y84vv9stodv.]

In short, the problem was with intermittent low network thread idle percent due 
to high response send time, resulting in high response queue size.

After some cycles of debugging and deploying the modified Kafka, we found out 
that the problem was in blocking DNS queries during the preparation of new 
connections.

[https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L548]

[https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java#L174]

[https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslServerAuthenticator.java#L209]

Here, the KafkaChannel.prepare() call eventually results in calling 
serverAddress().getHostName(), which does the reverse DNS lookup on the network 
thread, blocking all the other connections assigned to the same network thread. 
The DNS query is pretty simple; it just resolves the broker hostname from IP, 
so we solved the problem of slow response send time by adding the DNS record in 
/etc/hosts. Though I'm not sure why there's such a query at all, adding the 
fixed record seems to fix the problem in our fairly static environment (I 
cannot guarantee that it will work in other environments). In our case, the 
network thread was blocked for 5 seconds due to the default DNS query timeout 
in Linux.

We are writing this issue for those who come after because we couldn't find 
similar problems with Kafka on the web.

In addition, maybe Kafka developers may consider some proposals:

Adding networkThreadTimeNanos in 
[https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/core/src/main/scala/kafka/network/RequestChannel.scala#L201C30-L201C52]
 to the debug logs for each request, because it was hard to pinpoint the 
problem when the only supporting metric was response send time, which included 
the time to handle all connections during one selector poll
Adding networkThreadTimeNanos to Selector.SelectorMetrics 
(https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L1121)
Maybe cache the DNS queries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to