Hi, Igor, Thanks for the reply.
11. Yes, your proposal could work. Once the broker receives confirmation of the metadata change, I guess it needs to briefly block appends to the old replica, make sure the future log fully catches up and then make the switch? 13 (b). The kafka-storage.sh is only required in KIP-631 for a brand new KRaft cluster. If the cluster already exists and one just wants to add a log dir, it seems inconvenient to have to run the kafka-storage.sh tool again. Thanks, Jun On Thu, Dec 1, 2022 at 10:19 AM Igor Soarez <soa...@apple.com.invalid> wrote: > > Hi Jun, > > Thank you for reviewing the KIP. Please find my replies to > your comments below. > > 10. Thanks for pointing out this typo; it has been corrected. > > > 11. I agree that the additional delay in switching to the > future replica is undesirable, however I see a couple of > issues if we forward the request to the controller > as you describe: > > a) If the controller persists the change to the log directory > assignment before the future replica has caught up and there > is a failure in the original log directory then if the broker > is a leader for the partition there will be no failover > and the partition will become unavailable. It is not safe to > call AssignReplicasToDirectories before the replica exists > in the designated log directory. > > b) An existing behavior we'd like to maintain if possible is > the ability to move partitions between log directories when a > broker is offline, as it can be very useful to manage storage > space. ZK brokers will load and accept logs in the new > location after startup. > Maintaining this behavior requires that the broker be able to > override/correct assignments that are seen in the cluster metadata. > i.e. the broker is the authority on log directory placement in case > of mismatch. > If we want to keep this feature and have the controller send log > directory reassignments, we'll need a way to distinguish between > mismatch due to offline movement and mismatch due to controller > triggered reassignment. > > To keep the delay low, instead of sending AlterReplicaLogDirs > within the lock the RPC can be queued elsewhere when the future > replica first catches up. ReplicaAlterLogDirsThread can keep > going and not switch the replicas yet. > Once the broker receives confirmation of the metadata change > it can then briefly block appends to the old replica and make the switch. > In the unlikely event that source log directory fails between the moment > AssignReplicasToDirectories is acknowledged by the controller and > before the broker is able to make the switch, then the broker > can issue AssignReplicasToDirectories of that replica back to the offline > log directory to let the controller know that the replica is actually > offline. > What do you think? > > > 12. Indeed, the metadata.log.dir, if explicitly defined to a separate > directory should not be included in the directory UUID list sent > to the controller in broker registration and heartbeat requests. > I have updated the KIP to make this explicit. > > > 13. Thank you for making this suggestion. > Let's address the different scenarios you enumerated: > > a) When enabling JBOD for an existing KRaft cluster > > In this scenario, the broker finds a single log directory configured > in `log.dirs`, with an already existing `meta.properties`, which is > simply missing `directory.id <http://directory.id/>` and `directory.ids`. > > It is safe to have the broker automatically generate the log dir > UUID and update the `meta.properties` file. This removes any need > to have extra steps in upgrading and enabling of this feature, > so it is quite useful. > > > b) Adding a log dir to an existing JBOD KRaft cluster > > In this scenario, the broker finds a shorter list of `directory.ids` > in `meta.properties` than what is configured in `log.dirs`. > > KIP-631 introduced the requirement to run the command kafka-storage.sh > to format storage directories. > > Currently, if the broker in KRaft mode cannot find `meta.properties` > in each log directory it will fail at startup. KIP-785 proposes > removing the need to run the format storage command, but it is > still open. If new log directories are being added the storage > command must be run anyway. So I don't think there will be any > benefit in this case. > > > c) Removing a log dir from an existing JBOD KRaft cluster > > In this scenario the broker finds a larger list of `directory.ids` > in `meta.properties` than what is configured in `log.dirs`. > > The removal of log directories requires an explicit update to > the `log.dirs` configuration, so it is also safe to have the > broker automatically update `directory.ids` in `meta.properties` > to remove the extra UUIDs. It's also useful to drop the requirement > to run the storage command after removal of log directories from > configuration, as it reduces operational burden. > > > c) Upgrading a JBOD ZK cluster to KRaft > > In ZooKeeper brokers write `meta.properties` with `version=0`, > but in KRaft mode brokers require `version=1`. > According to the current discussion on KIP-866, the broker will > automatically upgrade meta.properties from `version=0` to `version=1`. > Assuming the meta.properties gets translated into `version=1` as part > of KIP-866, then the broker should find `meta.properties` which are > simply missing the two new fields: `directory.id <http://directory.id/>` > and `directory.ids`. > So it should be OK for the broker to generate UUIDs for each log > directory and update the `meta.properties` file. > > > I have updated the KIP to remove the need to run the storage command > in these scenarios - apart from b) where it should already be a > requirement AFAICT - and describe how the broker will automatically > update `directory.id <http://directory.id/>` and `directory.ids` in > `meta.properties` if > required. > > > 14. Thank you for pointing this out. Yes, the controller should > simply delegate the log directory choice back to the broker, > by assigning `Uuid.ZERO` to all the replicas that were assigned > to removed log directories. I have updated the KIP under > "Controller" > "Broker registration". > > > 15. Yes. Same as in "Metadata caching", replicas are considered > offline if the replica references a log directory which is not in the > list of online log directories for the broker ID hosting the replica. > This means the controller will not assign leadership those replicas. > This is described under "Controller" > "Handling log directory failures". > > > 16. We'll need to bump metadata.version to gate use of the records > and RPC changes. I've added mention of this under the > "Compatibility, Deprecation, and Migration Plan" section. > > > > Please let me know of any further feedback. > > Best, > > -- > Igor