Nurzhan created KAFKA-19919:
-------------------------------
Summary: Network Threads Blocked by Synchronous Reverse DNS
Lookups During Connection Establishment
Key: KAFKA-19919
URL: https://issues.apache.org/jira/browse/KAFKA-19919
Project: Kafka
Issue Type: Bug
Components: clients, core, network, security
Affects Versions: 3.6.1
Reporter: Nurzhan
Fix For: 3.8.2, 3.9.2, 4.1.0, 4.0.1, 4.0.0, 3.9.1, 3.9.0, 3.8.1,
3.8.0, 3.7.2, 3.7.1, 3.7.0, 3.6.2, 3.6.1
We had an issue with network threads described in
[https://lists.apache.org/thread/hr26jkgsg243s8oyy3gq5y84vv9stodv.]
In short, the problem was with intermittent low network thread idle percent due
to high response send time, resulting in high response queue size.
After some cycles of debugging and deploying the modified Kafka, we found out
that the problem was in blocking DNS queries during the preparation of new
connections.
[https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L548]
[https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/KafkaChannel.java#L174]
[https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslServerAuthenticator.java#L209]
Here, the KafkaChannel.prepare() call eventually results in calling
serverAddress().getHostName(), which does the reverse DNS lookup on the network
thread, blocking all the other connections assigned to the same network thread.
The DNS query is pretty simple; it just resolves the broker hostname from IP,
so we solved the problem of slow response send time by adding the DNS record in
/etc/hosts. Though I'm not sure why there's such a query at all, adding the
fixed record seems to fix the problem in our fairly static environment (I
cannot guarantee that it will work in other environments). In our case, the
network thread was blocked for 5 seconds due to the default DNS query timeout
in Linux.
We are writing this issue for those who come after because we couldn't find
similar problems with Kafka on the web.
In addition, maybe Kafka developers may consider some proposals:
Adding networkThreadTimeNanos in
[https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/core/src/main/scala/kafka/network/RequestChannel.scala#L201C30-L201C52]
to the debug logs for each request, because it was hard to pinpoint the
problem when the only supporting metric was response send time, which included
the time to handle all connections during one selector poll
Adding networkThreadTimeNanos to Selector.SelectorMetrics
(https://github.com/apache/kafka/blob/be816b82d25370ceac697ccf7c88cea873e9b4e3/clients/src/main/java/org/apache/kafka/common/network/Selector.java#L1121)
Maybe cache the DNS queries
--
This message was sent by Atlassian Jira
(v8.20.10#820010)