[
https://issues.apache.org/jira/browse/KAFKA-9261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manikumar reopened KAFKA-9261:
------------------------------
> NPE when updating client metadata
> ---------------------------------
>
> Key: KAFKA-9261
> URL: https://issues.apache.org/jira/browse/KAFKA-9261
> Project: Kafka
> Issue Type: Bug
> Reporter: Jason Gustafson
> Assignee: Jason Gustafson
> Priority: Major
> Fix For: 2.4.0, 2.3.2
>
>
> We have seen the following exception recently:
> {code}
> java.lang.NullPointerException
> at java.base/java.util.Objects.requireNonNull(Objects.java:221)
> at org.apache.kafka.common.Cluster.<init>(Cluster.java:134)
> at org.apache.kafka.common.Cluster.<init>(Cluster.java:89)
> at
> org.apache.kafka.clients.MetadataCache.computeClusterView(MetadataCache.java:120)
> at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:82)
> at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:58)
> at
> org.apache.kafka.clients.Metadata.handleMetadataResponse(Metadata.java:325)
> at org.apache.kafka.clients.Metadata.update(Metadata.java:252)
> at
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1059)
> at
> org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:845)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:548)
> at
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
> at
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1281)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201)
> {code}
> The client assumes that if a leader is included in the response, then node
> information must also be available. There are at least a couple possible
> reasons this assumption can fail:
> 1. The client is able to detect stale partition metadata using leader epoch
> information available. If stale partition metadata is detected, the client
> ignores it and uses the last known metadata. However, it cannot detect stale
> broker information and will always accept the latest update. This means that
> the latest metadata may be a mix of multiple metadata responses and therefore
> the invariant will not generally hold.
> 2. There is no lock which protects both the fetching of partition metadata
> and the live broker when handling a Metadata request. This means an
> UpdateMetadata request can arrive concurrently and break the intended
> invariant.
> It seems case 2 has been possible for a long time, but it should be extremely
> rare. Case 1 was only made possible with KIP-320, which added the leader
> epoch tracking. It should also be rare, but the window for inconsistent
> metadata is probably a bit bigger than the window for a concurrent update.
> To fix this, we should make the client more defensive about metadata updates
> and not assume that the leader is among the live endpoints.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)