[
https://issues.apache.org/jira/browse/KAFKA-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Gustafson resolved KAFKA-10371.
-------------------------------------
Resolution: Fixed
> Partition reassignments can result in crashed ReplicaFetcherThreads.
> --------------------------------------------------------------------
>
> Key: KAFKA-10371
> URL: https://issues.apache.org/jira/browse/KAFKA-10371
> Project: Kafka
> Issue Type: Bug
> Components: core
> Reporter: Steve Rodrigues
> Assignee: David Jacot
> Priority: Critical
>
> A Kafka system doing partition reassignments got stuck with the reassignment
> partially done and the system with a non-zero number of URPs and increasing
> max lag.
> Looking in the logs, we see:
> {noformat}
> [ERROR] 2020-07-31 21:22:23,984 [ReplicaFetcherThread-0-3]
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3,
> fetcherId=0] Error due to
> org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while
> fetching partition state for foo
> [INFO] 2020-07-31 21:22:23,986 [ReplicaFetcherThread-0-3]
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=4, leaderId=3,
> fetcherId=0] Stopped
> {noformat}
> Investigating further and with some helpful changes to the exception (which
> was not generating a stack trace because it was a client-side exception), we
> see on a test run:
> {noformat}
> [2020-08-06 19:58:21,592] ERROR [ReplicaFetcher replicaId=2, leaderId=1,
> fetcherId=0] Error due to (kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.NotLeaderOrFollowerException: Error while
> fetching partition state for topic-test-topic-85
> at org.apache.kafka.common.protocol.Errors.exception(Errors.java:415)
> at
> kafka.server.ReplicaManager.getPartitionOrException(ReplicaManager.scala:645)
> at
> kafka.server.ReplicaManager.localLogOrException(ReplicaManager.scala:672)
> at
> kafka.server.ReplicaFetcherThread.logStartOffset(ReplicaFetcherThread.scala:133)
> at
> kafka.server.ReplicaFetcherThread.$anonfun$buildFetch$1(ReplicaFetcherThread.scala:316)
> at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
> at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
> at
> kafka.server.ReplicaFetcherThread.buildFetch(ReplicaFetcherThread.scala:309)
> {noformat}
> It appears that the fetcher is attempting to fetch for a partition that has
> been getting reassigned away. From further investigation, it seems that in
> KAFKA-10002 the StopReplica code was changed from:
> 1. Remove partition from fetcher
> 2. Remove partition from partition map
> to the other way around, but now the fetcher may race and attempt to build a
> fetch for a partition that's no longer mapped. In particular, since the
> logOrException code is being called from logStartOffset which isn't protected
> against NotLeaderOrFollowerException, just against KafkaStorageException, the
> exception isn't caught and throws all the way out, killing the replica
> fetcher thread.
> We need to switch this back.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)