Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Jun Rao Wed, 21 Dec 2022 16:25:39 -0800

Hi, Igor,

Thanks for the reply.


11. Yes, your proposal could work. Once the broker receives confirmation of
the metadata change, I guess it needs to briefly block appends to the old
replica, make sure the future log fully catches up and then make the switch?

13 (b). The kafka-storage.sh is only required in KIP-631 for a brand new
KRaft cluster. If the cluster already exists and one just wants to add a
log dir, it seems inconvenient to have to run the kafka-storage.sh tool
again.

Thanks,

Jun


On Thu, Dec 1, 2022 at 10:19 AM Igor Soarez <soa...@apple.com.invalid>
wrote:

>
> Hi Jun,
>
> Thank you for reviewing the KIP. Please find my replies to
> your comments below.
>
> 10. Thanks for pointing out this typo; it has been corrected.
>
>
> 11. I agree that the additional delay in switching to the
> future replica is undesirable, however I see a couple of
> issues if we forward the request to the controller
> as you describe:
>
>    a) If the controller persists the change to the log directory
>    assignment before the future replica has caught up and there
>    is a failure in the original log directory then if the broker
>    is a leader for the partition there will be no failover
>    and the partition will become unavailable. It is not safe to
>    call AssignReplicasToDirectories before the replica exists
>    in the designated log directory.
>
>    b) An existing behavior we'd like to maintain if possible is
>    the ability to move partitions between log directories when a
>    broker is offline, as it can be very useful to manage storage
>    space. ZK brokers will load and accept logs in the new
>    location after startup.
>    Maintaining this behavior requires that the broker be able to
>    override/correct assignments that are seen in the cluster metadata.
>    i.e. the broker is the authority on log directory placement in case
>    of mismatch.
>    If we want to keep this feature and have the controller send log
>    directory reassignments, we'll need a way to distinguish between
>    mismatch due to offline movement and mismatch due to controller
>    triggered reassignment.
>
> To keep the delay low, instead of sending AlterReplicaLogDirs
> within the lock the RPC can be queued elsewhere when the future
> replica first catches up. ReplicaAlterLogDirsThread can keep
> going and not switch the replicas yet.
> Once the broker receives confirmation of the metadata change
> it can then briefly block appends to the old replica and make the switch.
> In the unlikely event that source log directory fails between the moment
> AssignReplicasToDirectories is acknowledged by the controller and
> before the broker is able to make the switch, then the broker
> can issue AssignReplicasToDirectories of that replica back to the offline
> log directory to let the controller know that the replica is actually
> offline.
> What do you think?
>
>
> 12. Indeed, the metadata.log.dir, if explicitly defined to a separate
> directory should not be included in the directory UUID list sent
> to the controller in broker registration and heartbeat requests.
> I have updated the KIP to make this explicit.
>
>
> 13. Thank you for making this suggestion.
> Let's address the different scenarios you enumerated:
>
>  a) When enabling JBOD for an existing KRaft cluster
>
>  In this scenario, the broker finds a single log directory configured
>  in `log.dirs`, with an already existing `meta.properties`, which is
>  simply missing `directory.id <http://directory.id/>` and `directory.ids`.
>
>  It is safe to have the broker automatically generate the log dir
>  UUID and update the `meta.properties` file. This removes any need
>  to have extra steps in upgrading and enabling of this feature,
>  so it is quite useful.
>
>
>  b) Adding a log dir to an existing JBOD KRaft cluster
>
>  In this scenario, the broker finds a shorter list of `directory.ids`
>  in `meta.properties` than what is configured in `log.dirs`.
>
>  KIP-631 introduced the requirement to run the command kafka-storage.sh
>  to format storage directories.
>
>  Currently, if the broker in KRaft mode cannot find `meta.properties`
>  in each log directory it will fail at startup. KIP-785 proposes
>  removing the need to run the format storage command, but it is
>  still open. If new log directories are being added the storage
>  command must be run anyway. So I don't think there will be any
>  benefit in this case.
>
>
>  c) Removing a log dir from an existing JBOD KRaft cluster
>
>  In this scenario the broker finds a larger list of `directory.ids`
>  in `meta.properties` than what is configured in `log.dirs`.
>
>  The removal of log directories requires an explicit update to
>  the `log.dirs` configuration, so it is also safe to have the
>  broker automatically update `directory.ids` in `meta.properties`
>  to remove the extra UUIDs. It's also useful to drop the requirement
>  to run the storage command after removal of log directories from
>  configuration, as it reduces operational burden.
>
>
>  c) Upgrading a JBOD ZK cluster to KRaft
>
>  In ZooKeeper brokers write `meta.properties` with `version=0`,
>  but in KRaft mode brokers require `version=1`.
>  According to the current discussion on KIP-866, the broker will
>  automatically upgrade meta.properties from `version=0` to `version=1`.
>  Assuming the meta.properties gets translated into `version=1` as part
>  of KIP-866, then the broker should find `meta.properties` which are
>  simply missing the two new fields: `directory.id <http://directory.id/>`
> and `directory.ids`.
>  So it should be OK for the broker to generate UUIDs for each log
>  directory and update the `meta.properties` file.
>
>
> I have updated the KIP to remove the need to run the storage command
> in these scenarios - apart from b) where it should already be a
> requirement AFAICT - and describe how the broker will automatically
> update `directory.id <http://directory.id/>` and `directory.ids` in
> `meta.properties` if
> required.
>
>
> 14. Thank you for pointing this out. Yes, the controller should
> simply delegate the log directory choice back to the broker,
> by assigning `Uuid.ZERO` to all the replicas that were assigned
> to removed log directories. I have updated the KIP under
> "Controller" > "Broker registration".
>
>
> 15. Yes. Same as in "Metadata caching", replicas are considered
> offline if the replica references a log directory which is not in the
> list of online log directories for the broker ID hosting the replica.
> This means the controller will not assign leadership those replicas.
> This is described under "Controller" > "Handling log directory failures".
>
>
> 16. We'll need to bump metadata.version to gate use of the records
> and RPC changes. I've added mention of this under the
> "Compatibility, Deprecation, and Migration Plan" section.
>
>
>
> Please let me know of any further feedback.
>
> Best,
>
> --
> Igor

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to