Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Igor Soarez Wed, 02 Nov 2022 10:31:54 -0700

Hi Tom,

Thank you for having a look.

0. Thanks for pointing this out. I think it's worth having a new sub-command
in `kafka-storage.sh` — `update-directories` — more suitable for situations
where `metadata.properties` already exists. I've updated the section with
further detail.

When upgrading from a non-JBOD KRaft cluster all the configured log
directories will already be "formatted". We could require that
`update-directories` is run as part of the migration, but I think it seems
easier to have the broker simply update the two properties `directory.id`
and `directory.ids` in `metadata.properties` as necessary on startup.
I don't have a particularly strong opinion either way, it just seems
to create less friction for operators during the upgrade at very little cost.
This is described under "Public Interfaces" > `meta.properties`.
What do you think?

1. That makes sense. I've tried to make the example clearer by adding
further detail as you suggest. Please let me know if you see further room for
improvement.

2. Indeed. I've updated the section to reflect this.

3. The motivation was to avoid prolonging the time window between a
partition being assigned to a broker in the cluster metadata and the broker
actually being able to take leadership for it. But revisiting this question
now, it feels it's a better choice to keep things simple and leave
this possible optimisation for future work.

I've updated the KIP to reflect that the controller does not assign
partitions to log directories, instead using Uuid.ZERO and letting the
broker communicate back the chosen log directory via the
ASSIGN_REPLICAS_TO_DIRECTORIES RPC.

4. This perhaps another unnecessary optimisation. The intent was to
allow use-cases that don't use JBOD to bypass this initialisation cost.
We can keep it simple and remove this unnecessary optimisation.
I've updated the KIP accordingly.

5. This is a good idea. I've added a new "Future work" section with
the pieces we know are missing.

Regarding using the BrokerHeartbeat RPC: I think if we avoid sending the
full list of log directories in every heartbeat request as you describe
makes this option more attractive. I've gotten rid of the
LOG_DIRECTORIES_OFFLINE RPC and added details as to how the BrokerHeartbeat
RPC request will include a new `LogDirsOfflined` field.

I still don't think it makes sense to add a `LogDirsOnlined` field,
as it is not possible for a log directory to come back online without
a broker restart — which would result in a new `BrokerRegistration` request.

Thanks,

--
Igor

> On 1 Nov 2022, at 17:59, Tom Bentley <tbent...@redhat.com> wrote:
> 
> Hi Igor,
> 
> Thanks for the KIP, I've finally managed to take an initial look.
> 
> 0. You mention the command line tools (which one?) in the public interfaces
> section, but don't spell out how it changes -- what options are added.
> Reading the proposed changes it suggests that there are no changes to the
> supported options and that it is done automatically during initial
> formatting based on the broker config. But what about the case where we're
> upgrading an existing non-JBOD KRaft cluster where the meta.properties
> already exists? Do we just run `./bin/kafka-storage.sh format -c
> /tmp/server.properties` again? How would an operator remove an existing log
> dir?
> 
> 1. In the example for the storage format command I think it's worth
> explaining it in a little more detail. i.e. that the `directory.ids` has
> three items: two for the configured log.dirs and one for the configured
> metadata.log.dir.
> 
> 2. In 'Broker lifecycle management' you presumably want to check that the
> directory ids for each log dir are actually unique.
> 
> 3. I don't understand the motivation for having the controller decide the
> log dir for new replicas. I think there are two cases we want to support:
> a) Where the user specifies the log dir (likely based on some external
> information). This is out of scope for this KIP.
> b) If the user didn't specify, isn't the broker in a better position to
> decide (for example, based on free storage), and the communicate back to
> the controller the log dir that was chosen using the
> ASSIGN_REPLICAS_TO_DIRECTORIES RPC?
> 
> 4. Broker registration. I don't understand the intent behind the
> optimization for the single log dir case (last bullet). "Brokers whose
> registration indicate that multiple log directories are configured remain
> FENCED until all log directory assignments for that broker are learnt by
> the active controller and persisted into metadata." is something you've
> committed to anyway.
> 
> 5. AFAICS there's no way for the user to determine via the Kafka protocol
> which directory id corresponds to which log dir path. I.e. you've not
> changed any of the Admin APIs. Perhaps adding a Future Work section to
> spell out the pieces we know are missing would be a good idea?
> 
> I would second Jason's idea for piggybacking on- and off-line state changes
> on the BrokerHeartbeat RPC. I suspect the complexity of making this
> incrementally isn't so great, given that both broker and controller need to
> keep track of the on- and off-line directories anyway. i.e. We could add
> LogDirsOfflined and LogDirsOnlined fields to both request and response and
> have the broker keep including a log dir in requests until acknowledged in
> the response, but otherwise they'd be empty.
> 
> Thanks again,
> 
> Tom
> 
> On Tue, 25 Oct 2022 at 19:59, Igor Soarez <soa...@apple.com.invalid> wrote:
> 
>> Hello,
>> 
>> There’s now a proposal to address ZK to KRaft migration — KIP-866 — but
>> currently it doesn't address JBOD so I've decided to update this proposal
>> to address that migration scenario.
>> 
>> So given that:
>> 
>> - When migrating from a ZK cluster running JBOD to KRaft, brokers
>> registering in KRaft mode will need to be able to register all configured
>> log directories.
>> - As part of the migration, the mapping of partition to log directory will
>> have to be learnt by the active controller and persisted into the cluster
>> metadata.
>> - It isn’t safe to allow for leadership from replicas without this
>> mapping, as if the hosting log directory fails there will be no failover
>> mechanism.
>> 
>> I have updated the proposal to reflect that:
>> 
>> - Multiple log directories may be indicated in the first broker
>> registration referencing log directory UUIDs. We no longer require a single
>> log directory to start with.
>> - The controller must never assign leadership to a replica in a broker
>> registered with multiple log directories, unless the partition to log
>> directory assignment is already in the cluster metadata.
>> - The broker should not be unfenced until all of its partition to log
>> directory mapping is persisted into cluster metadata
>> 
>> I've also added details as to how the ZK to KRaft migration can work in a
>> cluster that is already operating with JBOD.
>> 
>> Please have a look and share your thoughts.
>> 
>> --
>> Igor
>> 
>> 
>>

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to