Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Jose Garcia Sancio Mon, 28 Sep 2020 21:25:48 -0700

Hi Guozhang,

Thanks for your feedback. It was very helpful. See my comments below.

Changes to the KIP:
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=28&selectedPageVersions=27

On Sun, Sep 27, 2020 at 9:02 PM Guozhang Wang <wangg...@gmail.com> wrote:
>
> Hello Jose,
>
> Thanks for the KIP. Overall it looks great. I have a few meta / minor
> question, or maybe just clarifications below:
>
> Meta:
>
> 1. I want to clarify that if only the active controller would generate
> snapshots, OR would any voter in the quorum would generate snapshots, OR
> would even observers generate snapshots? Originally I thought it was the
> latter case, but I think reading through the doc I got confused by some
> paragraphs. E.g. you mentioned snapshots are generated by the Controller
> module, and observers would not have that module.

Sorry for the confusion and inconsistency here. Every replica of the
cluster metadata topic partition will generate a snapshot. That
includes the voters (leader and followers) and observers. In this KIP
the leader is the Active Controller, the voters are the Kafka
Controllers and the observers are the Metadata Cache.

I went through the KIP again and made sure to enumerate both Kafka
Controllers and Metadata Cache when talking about snapshot generation
and loading.

I renamed the new configurations to be prefixed by metadata instead of
controller.

I moved the terminology section to the top.

>
> 2. Following on Jun's previous comment: currently the __consumer_metadata
> log is replicated on ALL brokers since all voters and observers would
> replicate that topic. I know this may be out of the scope of this KIP but I
> think maybe only letting the voters to replicate (and periodically
> truncate) the log while observers only maintain the in-memory state and
> snapshots is a good trade-off here, assuming snapshot loading is relatively
> fast.

This is a good idea and optimization. It would save a write. I think
we need to think about the implication to KIP-642, the dynamic quorum
reassignment KIP, if we end up allowing observers to get "promoted" to
voters.

>
> 3. When a raft client is in the middle of loading a snapshot, should it
> reject any vote / begin-/end-/describe-quorum requests at the time? More
> generally, while a snapshot is being loaded, how should we treat the
> current state of the client when handling Raft requests.

Re: requesting votes and granting votes.

In the section "Changes to Leader Election", I think this section was
improved since your review. I mentioned that the raft client needs to
look at:

1. latest/largest snapshot epoch and end offset
2. the LEO of the replicated log

The voters should use the latest/largest of these two during the
election process.

Re: quorum state

For KIP-595 and KIP-630 the snapshot doesn't include any quorum
information. This may change in KIP-642.

>
> Minor:
>
> 4."All of the live replicas (followers and observers) have replicated LBO".
> Today the raft layer does not yet maintain LBO across all replicas, is this
> information kept in the controller layer? I'm asking because I do not see
> relevant docs in KIP-631 and hence a bit confused which layer is
> responsible for bookkeeping the LBOs of all replicas.

This is not minor! :). This should be done in the raft client as part
of the fetch protocol. Note that LBO is just a rename of log start
offset. If the current raft implementation doesn't manage this
information then we will have to implement this as part of
implementing this KIP (KIP-630).

> 5. "Followers and observers will increase their log begin offset to the
> value sent on the fetch response as long as the local Kafka Controller and
> Metadata Cache has generated a snapshot with an end offset greater than or
> equal to the new log begin offset." Not sure I follow this: 1) If observers
> do not generate snapshots since they do not have a Controller module on
> them, then it is possible that observers do not have any snapshots at all
> if they do not get one from the leader, in that case they would never
> truncate the logs locally;

Observers will have a Metadata Cache which will be responsible for
generating snapshots.

> 2) could you clarify on "value sent on the fetch
> response", are you referring to the "HighWatermark", or "LogStartOffset" in
> the schema, or some other fields?

The log begin offset is the same as the log start offset. This KIP
renames that field in the fetch response. I am starting to think that
renaming this field in this KIP is not worth it. What do you think?

>
> 6. The term "SnapshotId" is introduced without definition at first. My
> understanding is it's defined as a combo of <endOffset, epoch>, could you
> clarify if this is the case?

Good point. I added this sentence to the Snapshot Format section and
terminology section:

"Each snapshot is uniquely identified by a SnapshotId, the epoch and
end offset of the records in the replicated log included in the
snapshot."

> BTW I think the term "endOffset" is a term
> used per log, and maybe calling the part of the SnapshotId "nextOffset" is
> better since that offset is likely already filled with a record.

I think using "end offset" is consistent with Kafka terminology. For
example, in KIP-595 we use (epoch, end offset) to describe the
DivergingEpoch.

> 7. This is a very nit one: "If the latest snapshot has an epoch E and end
> offset O and is it newer than the LEO of the replicated log, then the
> replica must set the LBO and LEO to O." On wiki `O` and `0` looks very much
> the same  and that confused me a couple of times... I'd suggest we phrase
> any of such occussions to "an epoch e1 and offset o1". Also for LEO since
> we would not really know what would be its epoch (since it may be bumped)
> when comparing we only care about the offset and not about the epoch right?
> If yes, please clarify that in the doc as well.

Agreed. Thanks for the suggestion. Done.

> 8. "LEO - log end offset - the largest offset and epoch that has been
> written to disk." I think LEO is the "next" offset to be written to the log
> right? Also it seems consistent with your diagrams.

Yes. Fixed the diagram and description. I double checked that the rest
of the KIP refers to LEO correctly.

> 9. "... will send a vote request and response as if they had an empty log."
> Not sure I completely follow this, do you mean that they will set
> "LastOffsetEpoch/LastOffset" as "-1/0" when sending a vote request, and
> upon receiving a vote request it would compare the
> request's "LastOffsetEpoch/LastOffset" with "-1/0" as well?

That paragraph contradicted the section "Changes to Leader Election".
I made improvements to the "Changes to Leader Election" section but
forgot to update this paragraph. I have simplified this paragraph to
just refer to that section.

>
> 10. In the FetchSnapshot response schema, just to clarify the "Position" :
> "The byte position within the snapshot." is referring to the starting byte
> position of the returned snapshot data, right?

Yes. In Java land this will be the position argument used in
FileChannel.write(ByteBuffer, long).
https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/nio/channels/FileChannel.html#write(java.nio.ByteBuffer,long)

I improved the description for that "Position" field in the JSON
schema for the FetchSnapshot request and response.

Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Reply via email to