Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Jun Rao Wed, 25 Jan 2023 15:14:10 -0800

Hi, Igor,

Thanks for the reply. A few more comments/questions.

30. In ZK mode, LeaderAndIsr has a field isNew that indicates whether a
replica is newly created or not. How do we convey the same thing with
metadata records in KRaft mode?

31. If one replaces a log dir with a disk, I guess we need to assign the
same UUID to the log dir? I am not sure if we could do this since we don't
store the mapping from log dir name to its UUID.

32. "If the partition is not new and does not exist, and there are any
offline log directories, the broker uses AssignReplicasToDirs to change the
metadata assignment to an offline log directory." Hmm, should the broker
send the HeartBeat request to propagate the failed log dir instead?
Actually, does it need to send it at all since the same information should
already be sent when the offline dir was first detected?

33. "If the partition already exists in another log directory and is a
future replica in the log directory indicated by the metadata, the broker
will replace the current replica with the future replica." It seems that
the broker needs to make sure that the future replica catches up with the
current replica first?

34. "As a fallback safety mechanism in case of any synchronisation issues,
when a log directory fails, the broker will check log directory assignments
in the metadata and if there are any missing replicas, the broker will use
AlterReplicaLogDirs assigning the replicas to the offline log directory, to
convey to the controller that these replicas are offline and need
leadership and ISR updates." Hmm, the motivation says that we want to avoid
large requests introduced in KIP-589. By sending AlterReplicaLogDirs on log
dir fail, it seems that we are issuing the same large request every time a
log dir fails?

35. "When replicas are moved between directories, using the existing
AlterReplicaLogDirs RPC, the receiving broker will start moving the
replicas using AlterReplicaLogDirs threads as usual."  AlterReplicaLogDirs
is new request. The existing one is AlterReplicaDirRequest.

36. "If the broker is configured with multiple log directories it remains
FENCED until it can verify that all partitions are assigned to the correct
log directories in the cluster metadata." Hmm, this could be a bit weird.
A broker probably shouldn't send any request other than HeartBeat to the
controller until it's unfenced. But it may need to send the
AssignReplicasToDirs request to complete the log dir assignment. Is this
required?

37. "Existing replicas without a log directory are either: Assumed to live
in a broker that isn’t yet configured with multiple log directories, and so
live in a single log directory, even if the UUID for that directory isn’t
yet known by the controller. It is not possible to trigger a log directory
failure from a broker that has a single log directory, as the broker would
simply shut down if there are no remaining online log directories." So, if
there is a single log dir, the broker doesn't send AssignReplicasToDirs
requests to the controller?

38. In the section under "If the partition is assigned to an online log
directory", it seems that we need to add another case. If the partition
exists and a future replica also exists in another dir, should we start the
process to replicate the current replica to the future replica. This is the
existing behavior.

39. "If there are offline log directories, the broker will fail at startup
if there is a mismatch between the number of entries in log.dirs  and
directory.ids." This seems a bit restricting since it means that one can't
add new log dir when an existing log dir is down?

40. "In a future release, broker registration without online log directory
UUIDs will be disallowed." Hmm, currently, the broker already fails after
log recovery if no log dir is online, right?

Thanks,

Jun

On Mon, Jan 23, 2023 at 3:59 AM Tom Bentley <tbent...@redhat.com> wrote:

> Hi Igor,
>
> Thanks for your replies on points 21-25. Those all LGTM. See below for a
> further thought on the meta.properties upgrade.
>
> Kind regards,
>
> Tom
>
>
> On Fri, 13 Jan 2023 at 18:09, Igor Soarez <soa...@apple.com.invalid>
> wrote:
>
> > Hi Tom,
> >
> > Thank you for having another look.
> >
> > 20. Upon a downgrade to a Kafka version that runs the current
> > "version == 1" assertion, then yes — a downgrade would not be possible
> > without first updating (manually) the meta.properties files back
> > to the previous version.
> >
> > We could prevent this issue if we consider the new fields optional and
> > not bump the version, but this takes away from the value of having
> > a version in the first place. I'm not sure this would be a good
> trade-off.
> >
> > IIUC in KIP-866 this issue was addressed by delaying the writing of
> > meta.properties under the new version until after the full transition to
> > KRaft mode is finished and relaxing the expectations around versioning.
> > We could take a similar approach by relaxing the current
> > expectation that the version must strictly be 1.
> >
> > Currently, MetaProperties#parse() checks that the version is 1 but only
> > requires that the cluster.id and node.id properties are present in the
> > file.
> > We can relax the requirement, ensuring only that the version is at least
> 1,
> > and land that change before this KIP is implemented – at least one
> release
> > beforehand. Then there will be some previous version for which a
> downgrade
> > is possible, even after the version has been bumped to 2 and the two new
> > properties are introduced. WDYT?
> >
>
> Doing that would delay JBOD support and mean existing users of JBOD must
> upgrade via that specific version.
>
> I think I favour a different way of relaxing the expectation around
> versioning:
> 1. In the new broker, update the meta.properties with the new keys but
> keeping version=1 (AFAICS this won't cause problems if we roll back to old
> broker binaries at any point prior to the completion of KRaft upgrade,
> because the code ignores unknown keys).
> 2. When the KRaft upgrade is complete (i.e. once a broker has written the
> final ZkMigrationRecord), update meta.properties to version=2, where the
> new keys are required.
>
> That would mean that during the upgrade the additional validation of
> meta.properties is missing. I think that's acceptable because the extra
> protection provided isn't something supported by ZK-based JBOD today, so
> there is no loss of functionality.
>
> Wdty?
>
>
> > 21. I think in this case we can reuse LOG_DIR_NOT_FOUND (code 57) and
> > perhaps
> > reword its description slightly to be a bit more generic.
> > I've updated the KIP to mention this.
> >
> > 22. This check should be enforced by the broker, as the controller cannot
> > know whether the current replica to logdir assignment is correct.
> > Currently, the controller does not unfence a broker that does not "want"
> > to be unfenced. The broker should not want to be unfenced while there
> > are still mismatching (wrong or missing) replica to logdir assignments.
> > I updated the KIP to clarify this.
> >
> > So a reason wouldn't be included in the heartbeat response but your point
> > about diagnosing is interesting. So I've updated the KIP to propose a
> > new broker metric reporting the number of mismatching replica assignments
> > - i.e. replica assignments that are either missing or incorrect in the
> > cluster metadata, as observed by the broker — until this metric is zero,
> > the broker would not want to be unfenced. Let me know what you think.
> >
> > 23. I don't see why not. We should be able to extend the system tests
> > proposed in KIP-866 to cover this.
> >
> > 24. Correct. Updated to remove the typo.
> >
> > 25. Good point. Updated to align the names.
> >
> >
> > Thank you,
> >
> > --
> > Igor
> >
> >
> >
>

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to