rondagostino opened a new pull request #10003: URL: https://github.com/apache/kafka/pull/10003
Brokers receive metadata from the Raft metadata quorum very differently than they do from ZooKeeper today, and this has implications for ReplicaManager. In particular, when a broker reads the metadata log it may not arrive at the ultimate state for a partition until it reads multiple messages. In normal operation the multiple messages associated with a state change will all appear in a single batch, so they can and will be coalesced and applied together. There are circumstances where messages associated with partition state changes will appear across multiple batches and we will be forced to coalesce these multiple batches together. The circumstances when this occurs are as follows: - When the broker restarts it must "catch up" on the metadata log, and it is likely that the broker will see multiple partition state changes for a single partition across different batches while it is catching up. For example, it will see the `TopicRecord` and the `PartitionRecords` for the topic creation, and then it will see any `IsrChangeRecords` that may have been recorded since the creation. The broker does not know the state of the topic partitions until it reads and coalesces all the messages. - The broker will have to "catch up" on the metadata log if it becomes fenced and then regains its lease and resumes communication with the metadata quorum. - A fenced broker may ultimately have to perform a "soft restart" if it was fenced for so long that the point at which it needs to resume fetching the metadata log has been subsumed into a metadata snapshot and is no longer independently fetchable. A soft restart will entail some kind of metadata reset based on the latest available snapshot plus a catchup phase to fetch after the snapshot end point. The first case -- during startup -- occurs before clients are able to connect to the broker. Clients are able to connect to the broker in the second case. It is unclear if clients will be able to to connect to the broker during a soft restart (the third case). We need a way to defer the application of topic partition metadata in all of the above cases, and while we are deferring the application of the metadata the broker will not service clients for the affected partitions. As a side note, it is arguable if the broker should be able to service clients while catching up or not. The decision to not service clients has no impact in the startup case -- clients can't connect yet at that point anyway. In the third case it is not yet clear what we are going to do, but being unable to service clients while performing a soft reset seems reasonable. In the second case it is most likely true that we will catch up quickly; it would be unusual to reestablish communication with the metadata quorum such that we gain a new lease and begin to catch up only to lose our lease again. So we need a way to defer the application of partition metadata and make those partitions unavailable while deferring state changes. This PR adds a new internal partition state to ReplicaManager to accomplish this. Currently the available partition states are simple `Online`, `Offline` (meaning a log dir failure) and `None` (meaning we don't know about it). We add a new `Deferred` state. We also rename a couple of methods that refer to "nonOffline" partitions to instead refer to "online" partitions. **The new `Deferred` state never happens when using ZooKeeper for metadata storage.** Partitions can only enter the `Deferred` state when using a KIP-500 Raft metadata quorum and one of the above 3 cases occurs. The testing strategy is therefore to leverage existing tests to confirm that there is no functionality change in the ZooKeeper case. We will add the logic for deferring/applying/reacting to deferred partition state in separate PRs since that code will never be invoked in the ZooKeeper world. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org