Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Jose Garcia Sancio Wed, 05 Aug 2020 12:14:04 -0700

Once again, thanks for the feedback Jason,

My changes to the KIP are here:
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=18&selectedPageVersions=17

And see my comments below...

On Mon, Aug 3, 2020 at 1:57 PM Jason Gustafson <ja...@confluent.io> wrote:
>
> Hi Jose,
>
> Thanks for the proposal. I think there are three main motivations for
> snapshotting over the existing compaction semantics.
>
> First we are arguing that compaction is a poor semantic fit for how we want
> to model the metadata in the cluster. We are trying to view the changes in
> the cluster as a stream of events, not necessarily as a stream of key/value
> updates. The reason this is useful is that a single event may correspond to
> a set of key/value updates. We don't need to delete each partition
> individually for example if we are deleting the full topic. Outside of
> deletion, however, the benefits of this approach are less obvious. I am
> wondering if there are other cases where the event-based approach has some
> benefit?
>

Yes. Another example of this is what KIP-631 calls FenceBroker. In the
current implementation of the Kafka Controller and the implementation
proposed in
KIP-631, whenever a broker is fenced the controller removes the broker
from the ISR and performs leader election if necessary. The impact of
this operation on replication is documented in section "Amount of Data
Replicated". I have also updated the KIP to reflect this.

> The second motivation is from the perspective of consistency. Basically we
> don't like the existing solution for the tombstone deletion problem, which
> is just to add a delay before removal. The case we are concerned about
> requires a replica to fetch up to a specific offset and then stall for a
> time which is longer than the deletion retention timeout. If this happens,
> then the replica might not see the tombstone, which would lead to an
> inconsistent state. I think we are already talking about a rare case, but I
> wonder if there are simple ways to tighten it further. For the sake of
> argument, what if we had the replica start over from the beginning whenever
> there is a replication delay which is longer than tombstone retention time?
> Just want to be sure we're not missing any simple/pragmatic solutions
> here...
>

We explore the changes needed to log compaction and the fetch protocol
such that it results in a consistent replicated log in the rejected
sections. I changed the KIP to also mention it in the motivation
section by adding a section called "Consistent Log and Tombstones"

> Finally, I think we are arguing that compaction gives a poor performance
> tradeoff when the state is already in memory. It requires us to read and
> replay all of the changes even though we already know the end result. One
> way to think about it is that compaction works O(the rate of changes) while
> snapshotting is O(the size of data). Contrarily, the nice thing about
> compaction is that it works irrespective of the size of the data, which
> makes it a better fit for user partitions. I feel like this might be an
> argument we can make empirically or at least with back-of-the-napkin
> calculations. If we assume a fixed size of data and a certain rate of
> change, then what are the respective costs of snapshotting vs compaction? I
> think compaction fares worse as the rate of change increases. In the case
> of __consumer_offsets, which sometimes has to support a very high rate of
> offset commits, I think snapshotting would be a great tradeoff to reduce
> load time on coordinator failover. The rate of change for metadata on the
> other hand might not be as high, though it can be very bursty.
>

This is a very good observation. If you assume that the number of keys
doesn't change but that we have frequent updates to its values then I
think that after log compaction the size of the compacted section of
the log is O(size of the data) + O(size of the tombstones). And as you
point out the size of the snapshot is also O(size of the data). I
think this is a reasonable assumption for topics like
__cluster_metadata and __consumer_offsets.

The difference is the number of reads required. With in-memory
snapshot we only need to read the log once. With log compaction we
need to read the log 3 times: 1. to update the in-memory state, 2.
generate the map of key to offset and 3. compact the log using the map
of keys to offset. I have updated the KIP and go into a lot more
details in section "Loading State and Frequency of Compaction

-- 
-Jose

Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Reply via email to