Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Jun Rao Wed, 30 Sep 2020 10:58:41 -0700

Hi, Jose,

Thanks for the updated KIP. A few more comments below.


40. LBO: Code wise, logStartOffset is used in so many places like Log,
ReplicaManager, LogCleaner, ReplicaFetcher, checkpoint files, etc. I am not
if it's worth renaming in all those places. If the main concern is to
avoid confusion, we can just always spell out logStartOffset.

41. metadata.snapshot.min.cleanable.ratio: Since the snapshot could be
empty initially, it's probably better to define the ratio as new_data_size
/ (new_data_size + snapshot_size). This avoids the dividing by zero issue
and is also consistent with the cleaner ratio definition for compacted
topics.

42. FetchSnapshotRequest: Since this is designed to fetch more than one
partition, it seems it's useful to have a per-partition maxBytes, in
addition to the request level maxBytes, just like a Fetch request?

43. FetchSnapshotResponse:
43.1 I think the confusing part for OFFSET_OUT_OF_RANGE is
that FetchSnapshotRequest includes EndOffset. So, OFFSET_OUT_OF_RANGE seems
to suggest that the provided EndOffset is wrong, which is not the intention
for the error code.
43.1 Position field seems to be the same as the one in
FetchSnapshotRequest. If we have both, should the requester verify the
consistency between two values and what should the requester do if the two
values don't match?

44. metric: Would a metric that captures the lag in offset between the last
snapshot and the logEndOffset be useful?

45. It seems the KIP assumes that every voter (leader and follower) and
observer has a local replicated log for __cluster_metadata. It would be
useful to make that clear in the overview section.

46. Does this KIP cover upgrading from older versions of Kafka? If so, do
we need IBP to guard the usage of modified FetchRequest and the new
FetchSnapshotRequest? If not, could we make it clear that upgrading will be
covered somewhere else?

Thanks,

Jun

On Mon, Sep 28, 2020 at 9:25 PM Jose Garcia Sancio <jsan...@confluent.io>
wrote:

> Hi Guozhang,
>
> Thanks for your feedback. It was very helpful. See my comments below.
>
> Changes to the KIP:
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=28&selectedPageVersions=27
>
> On Sun, Sep 27, 2020 at 9:02 PM Guozhang Wang <wangg...@gmail.com> wrote:
> >
> > Hello Jose,
> >
> > Thanks for the KIP. Overall it looks great. I have a few meta / minor
> > question, or maybe just clarifications below:
> >
> > Meta:
> >
> > 1. I want to clarify that if only the active controller would generate
> > snapshots, OR would any voter in the quorum would generate snapshots, OR
> > would even observers generate snapshots? Originally I thought it was the
> > latter case, but I think reading through the doc I got confused by some
> > paragraphs. E.g. you mentioned snapshots are generated by the Controller
> > module, and observers would not have that module.
>
> Sorry for the confusion and inconsistency here. Every replica of the
> cluster metadata topic partition will generate a snapshot. That
> includes the voters (leader and followers) and observers. In this KIP
> the leader is the Active Controller, the voters are the Kafka
> Controllers and the observers are the Metadata Cache.
>
> I went through the KIP again and made sure to enumerate both Kafka
> Controllers and Metadata Cache when talking about snapshot generation
> and loading.
>
> I renamed the new configurations to be prefixed by metadata instead of
> controller.
>
> I moved the terminology section to the top.
>
> >
> > 2. Following on Jun's previous comment: currently the __consumer_metadata
> > log is replicated on ALL brokers since all voters and observers would
> > replicate that topic. I know this may be out of the scope of this KIP
> but I
> > think maybe only letting the voters to replicate (and periodically
> > truncate) the log while observers only maintain the in-memory state and
> > snapshots is a good trade-off here, assuming snapshot loading is
> relatively
> > fast.
>
> This is a good idea and optimization. It would save a write. I think
> we need to think about the implication to KIP-642, the dynamic quorum
> reassignment KIP, if we end up allowing observers to get "promoted" to
> voters.
>
> >
> > 3. When a raft client is in the middle of loading a snapshot, should it
> > reject any vote / begin-/end-/describe-quorum requests at the time? More
> > generally, while a snapshot is being loaded, how should we treat the
> > current state of the client when handling Raft requests.
>
> Re: requesting votes and granting votes.
>
> In the section "Changes to Leader Election", I think this section was
> improved since your review. I mentioned that the raft client needs to
> look at:
>
> 1. latest/largest snapshot epoch and end offset
> 2. the LEO of the replicated log
>
> The voters should use the latest/largest of these two during the
> election process.
>
> Re: quorum state
>
> For KIP-595 and KIP-630 the snapshot doesn't include any quorum
> information. This may change in KIP-642.
>
> >
> > Minor:
> >
> > 4."All of the live replicas (followers and observers) have replicated
> LBO".
> > Today the raft layer does not yet maintain LBO across all replicas, is
> this
> > information kept in the controller layer? I'm asking because I do not see
> > relevant docs in KIP-631 and hence a bit confused which layer is
> > responsible for bookkeeping the LBOs of all replicas.
>
> This is not minor! :). This should be done in the raft client as part
> of the fetch protocol. Note that LBO is just a rename of log start
> offset. If the current raft implementation doesn't manage this
> information then we will have to implement this as part of
> implementing this KIP (KIP-630).
>
> > 5. "Followers and observers will increase their log begin offset to the
> > value sent on the fetch response as long as the local Kafka Controller
> and
> > Metadata Cache has generated a snapshot with an end offset greater than
> or
> > equal to the new log begin offset." Not sure I follow this: 1) If
> observers
> > do not generate snapshots since they do not have a Controller module on
> > them, then it is possible that observers do not have any snapshots at all
> > if they do not get one from the leader, in that case they would never
> > truncate the logs locally;
>
> Observers will have a Metadata Cache which will be responsible for
> generating snapshots.
>
> > 2) could you clarify on "value sent on the fetch
> > response", are you referring to the "HighWatermark", or "LogStartOffset"
> in
> > the schema, or some other fields?
>
> The log begin offset is the same as the log start offset. This KIP
> renames that field in the fetch response. I am starting to think that
> renaming this field in this KIP is not worth it. What do you think?
>
> >
> > 6. The term "SnapshotId" is introduced without definition at first. My
> > understanding is it's defined as a combo of <endOffset, epoch>, could you
> > clarify if this is the case?
>
> Good point. I added this sentence to the Snapshot Format section and
> terminology section:
>
> "Each snapshot is uniquely identified by a SnapshotId, the epoch and
> end offset of the records in the replicated log included in the
> snapshot."
>
> > BTW I think the term "endOffset" is a term
> > used per log, and maybe calling the part of the SnapshotId "nextOffset"
> is
> > better since that offset is likely already filled with a record.
>
> I think using "end offset" is consistent with Kafka terminology. For
> example, in KIP-595 we use (epoch, end offset) to describe the
> DivergingEpoch.
>
> > 7. This is a very nit one: "If the latest snapshot has an epoch E and end
> > offset O and is it newer than the LEO of the replicated log, then the
> > replica must set the LBO and LEO to O." On wiki `O` and `0` looks very
> much
> > the same  and that confused me a couple of times... I'd suggest we phrase
> > any of such occussions to "an epoch e1 and offset o1". Also for LEO since
> > we would not really know what would be its epoch (since it may be bumped)
> > when comparing we only care about the offset and not about the epoch
> right?
> > If yes, please clarify that in the doc as well.
>
> Agreed. Thanks for the suggestion. Done.
>
> > 8. "LEO - log end offset - the largest offset and epoch that has been
> > written to disk." I think LEO is the "next" offset to be written to the
> log
> > right? Also it seems consistent with your diagrams.
>
> Yes. Fixed the diagram and description. I double checked that the rest
> of the KIP refers to LEO correctly.
>
> > 9. "... will send a vote request and response as if they had an empty
> log."
> > Not sure I completely follow this, do you mean that they will set
> > "LastOffsetEpoch/LastOffset" as "-1/0" when sending a vote request, and
> > upon receiving a vote request it would compare the
> > request's "LastOffsetEpoch/LastOffset" with "-1/0" as well?
>
> That paragraph contradicted the section "Changes to Leader Election".
> I made improvements to the "Changes to Leader Election" section but
> forgot to update this paragraph. I have simplified this paragraph to
> just refer to that section.
>
> >
> > 10. In the FetchSnapshot response schema, just to clarify the "Position"
> :
> > "The byte position within the snapshot." is referring to the starting
> byte
> > position of the returned snapshot data, right?
>
> Yes. In Java land this will be the position argument used in
> FileChannel.write(ByteBuffer, long).
>
> https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/nio/channels/FileChannel.html#write(java.nio.ByteBuffer,long)
>
> I improved the description for that "Position" field in the JSON
> schema for the FetchSnapshot request and response.
>

Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Reply via email to