rondagostino opened a new pull request #10003:
URL: https://github.com/apache/kafka/pull/10003


   Brokers receive metadata from the Raft metadata quorum very differently than 
they do from ZooKeeper today, and this has implications for ReplicaManager.  In 
particular, when a broker reads the metadata log it may not arrive at the 
ultimate state for a partition until it reads multiple messages.  In normal 
operation the multiple messages associated with a state change will all appear 
in a single batch, so they can and will be coalesced and applied together.  
There are circumstances where messages associated with partition state changes 
will appear across multiple batches and we will be forced to coalesce these 
multiple batches together.  The circumstances when this occurs are as follows:
   
   - When the broker restarts it must "catch up" on the metadata log, and it is 
likely that the broker will see multiple partition state changes for a single 
partition across different batches while it is catching up.  For example, it 
will see the `TopicRecord` and the `PartitionRecords` for the topic creation, 
and then it will see any `IsrChangeRecords` that may have been recorded since 
the creation.  The broker does not know the state of the topic partitions until 
it reads and coalesces all the messages.
   - The broker will have to "catch up" on the metadata log if it becomes 
fenced and then regains its lease and resumes communication with the metadata 
quorum.
   - A fenced broker may ultimately have to perform a "soft restart" if it was 
fenced for so long that the point at which it needs to resume fetching the 
metadata log has been subsumed into a metadata snapshot and is no longer 
independently fetchable.  A soft restart will entail some kind of metadata 
reset based on the latest available snapshot plus a catchup phase to fetch 
after the snapshot end point.
   
   The first case -- during startup -- occurs before clients are able to 
connect to the broker.  Clients are able to connect to the broker in the second 
case.  It is unclear if clients will be able to to connect to the broker during 
a soft restart (the third case).
   
   We need a way to defer the application of topic partition metadata in all of 
the above cases, and while we are deferring the application of the metadata the 
broker will not service clients for the affected partitions.
   
   As a side note, it is arguable if the broker should be able to service 
clients while catching up or not.  The decision to not service clients has no 
impact in the startup case -- clients can't connect yet at that point anyway.  
In the third case it is not yet clear what we are going to do, but being unable 
to service clients while performing a soft reset seems reasonable.  In the 
second case it is most likely true that we will catch up quickly; it would be 
unusual to reestablish communication with the metadata quorum such that we gain 
a new lease and begin to catch up only to lose our lease again.
   
   So we need a way to defer the application of partition metadata and make 
those partitions unavailable while deferring state changes.  This PR adds a new 
internal partition state to ReplicaManager to accomplish this.  Currently the 
available partition states are simple `Online`, `Offline` (meaning a log dir 
failure) and `None` (meaning we don't know about it).  We add a new `Deferred` 
state.  We also rename a couple of methods that refer to "nonOffline" 
partitions to instead refer to "online" partitions.
   
   **The new `Deferred` state never happens when using ZooKeeper for metadata 
storage.** Partitions can only enter the `Deferred` state when using a KIP-500 
Raft metadata quorum and one of the above 3 cases occurs.  The testing strategy 
is therefore to leverage existing tests to confirm that there is no 
functionality change in the ZooKeeper case.  We will add the logic for 
deferring/applying/reacting to deferred partition state in separate PRs since 
that code will never be invoked in the ZooKeeper world.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to