Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Igor Soarez Fri, 03 Feb 2023 11:07:00 -0800

Hi Jun,

Thank you for your comments and questions.


30. Thank you for pointing this out. The isNew flag is not available
in KRaft mode. The broker can consider the metadata records:
If, and only if, the logdir assigned is Uuid.ZERO then the replica can
be considered new.

Being able to determine if a replica "isNew" is important to prevent
the remaining logdirs from filling up logdirs when some of them become
offline by re-creating replicas that already exist in the offline logdirs.
So the broker will refuse to create logs that are not new if there are
any offline logdirs.

If a logdir is removed from configuration, the controller will detect
this change upon broker registration and reset all partitions assigned
to the removed logdirs to Uuid.ZERO. In this case, it is OK for the
broker to assume that the partitions are new because they do not exist in
any _configured_ online or offline logdir, and the intended behavior is
to re-create them in one of the online logdirs anyway.

I have updated the KIP to make it clear broker decisions are based
on the metadata, and not on this flag.


31. I don't think I understand the question.
Why do we need to assign the same UUID?

A logdir may be replaced with a disk by replacing its configured path
with the new disk mount path under the `log.dirs` property.
While the broker was offline, the operator might have copied the contents
of the old logdir to the disk, or not.

If contents were copied over, then so was the logdir's meta.properties,
along with the UUID, in which case no change is necessary. The broker will
load all configured logdir paths, all existing meta.properties, and verify
that the full set of UUIDs is still congruent across all meta.properties
files. Neither broker or controller will know that something has changed,
and neither of them needs to. All partition assignments are still correct.
The mapping of UUID to logdir is determined by the meta.propeties
under that same logdir.

If the contents were not copied then this is assumed to be a new
and empty logdir. It should get a different UUID. When the broker loads
all meta.properties it will verify that one is missing for the new disk and
create it, generating a new UUID. It will also update the full set of UUIDs
listed in any other meta.properties files. On the broker registration
request the controller will notice a new UUID being registered, but also
notice a UUID missing.
Any topic partitions assinged to the now missing logdir UUID will be
updated to relate to UUID.Zero, so that the broker can place them in the
most suitable logdir - which is likely to be the new and empty one.


32. You are correct, the HeartBeat request should convey the failure
and the broker shouldn't need to send a AssignReplicasToDirs request.

The bit preceding that quote is important:
  "If the partition is assigned to an online log directory"
In this case the broker finds that the metadata indicates that a non-new
replica is assigned to an online logdir in the metadata but this replica
cannot actually be found in any online logdir.
So we want to tell the controller that the metadata is wrong, and that
the replica is actually offline.

This is a defensive design option.

In a scenario where for some reason the broker can see that the metadata
is incorrect about the logdir assignment of replica that existed in the
failed logdir, it is better to correct and recover than to allow the
problem to persist.

Ignoring the error could mean that the partition stays offline. If the
controller is only told about the UUID of the failure logdir, it won't
be able to determine that a leadership and ISR update is required for
any replica with an incorrect logdir assignment.

An alternative – when facing this unlikely failure scenario – would be
for the broker to error and exit, which would be more disruptive.


33. Correct. I should've made that clear. Updated.


34. No. It shouldn't be a large request, and it should only happen rarely.
This relates to point 32.
When a logdir fails, that failure is communicated to the controller by
indicating the logdir UUID in the heartbeat request. The controller
can determine that _the partitions assigned to that logdir UUID_
are now offline. But, if there are any partitions that were in that logdir
and do not have that same logdir UUID assigned to it in the cluster metadata
then the broker needs to signal that these are also offline, as the
controller will not be able to determine that without the assignment.

We expect each broker to proactively instruct the controller to keep the
metadata correct about the logdir assignment for each replica, so
situations where the metadata is wrong should be rare, and when they
happen only a small number of replicas should be affected. Hence this
should be both a small and rare request.


35. Hmm, I could not find the string "AlterReplicaDirRequest"
in the source:
  https://github.com/apache/kafka/search?q=AlterReplicaDirRequest

I'm referring to this API key:
  clients/src/main/java/org/apache/kafka/common/protocol/ApiKeys.java#L78


36. The risk is that if the broker is unfenced while the controller still
has an incorrect view of the logdir assignment it may assign leadership
to the broker for some partition which is incorrectly assigned in metadata.
If that happens, when a logdir fails, the heartbeat request
indicating the failed logdir UUID will not cause the controller to
take action and reassign leadership, and we may end up with an unavailable
partition.

The controller will assume that partition leadership is being performed
correctly and will not take any action, as long as it thinks the broker
is alive, and that the partition is assigned to an online logdir.
It could be interesting to find a more general solution to this issue,
as that would eliminate a wider range of failures in Kafka. But I don't
currently have any suggestions there.

The requests sent while still fenced aim to correct the logdir assignment
for all of the partitions in the broker. One of the reasons that the
assignment may be incorrect is that an operator might have relocated some
partitions to a different logdir while the broker was offline.
This is a currently supported feature - albeit probably not widely known.

Why is it important that there should be no other requests while the
broker is still fenced?


37. I had originally proposed that if there is a single logdir configured
the controller could assume that all the existing replicas are assigned
to the only logdir indicated in broker registration request, provided
there isn't a previous registration that indicates any logdir UUIDs.
This would avoid the broker sending AssignReplicasToDirs to populate the
the initial assignments.

If the broker is registering with a single logdir, but the previous broker
registration indicates some logdir UUID then the controller cannot make this
simplification, as the logdir could be a new one, or a previous second
logdir might have been removed from configuration and the current assignment
is unclear.

We could maybe say that whenever there is a single logdir, the broker
will not bother about the assignment in general. The downside of this is that
there might be more work to do later (more partition assignments to correct)
when a second logdir is configured. I think may be more disruptive.
It is preferable to spread out the effort to maintain a correct
assignment in the metadata.

Tom Bentley raised this in point 4. and since it's a not strictly
necessary optimisation I updated the KIP to remove it back then.
Do you think we should keep the optimisation?


38. Correct. I've updated the KIP.


39. I think I forgot to update this after I changed the proposal to say
the meta.properties are automatically updated. I have updated this
section to clarify that the broker will automatically update the
file if possible.

A new logdir can be added while there are other, offline logdirs, as
long as the set of UUIDs in `directory.ids` is expanded to include the
new one. So the length of UUIDs in `directory.ids` and paths in
`log.dirs` should always match.

It is important that the broker be able to distinguish between UUIDs
for logdirs that are offline, vs UUIDs for logdirs that were removed
from configuration.

If the broker starts up, configured with two logdirs, each logdir contains
a meta.properties file indicating three different UUIDs under
`directory.ids`, but only one of the configured logdirs is accessible
(online), then it is not possible for the broker to automatically update
the file, as it won't be able to distinguish between the UUID for the
offline logdir and the removed logdir. In this case the broker should
fail to start. The operator can either bring the offline logdir back up,
restore the log.dirs configuration or manually update meta.properties.


40. Indeed. What I meant to say here is that the controller should not
accept broker registration requests that do not indicate any online
logdir UUIDs. We don't expect the broker would send these anyway.

During the upgrade from non JBOD we could allow brokers to register
without specifying any logdir UUID (online or offline). But thinking
about this again now, I don't think it will be necessary — this idea
was from before the metadata.version feature flag change was introduced.
BrokerRegistrationRequest should only include logdir UUIDs after
all servers are upgraded, and by then all logdirs will have an UUID assigned.

I've updated the KIP to clarify that BrokerRegistrationRequest must always
include some online logdir UUID.


Best,

--
Igor

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to