Hi Kevin, thanks for reading and responding to the KIP!

>  What is the motivation for lowering this default limit from 8MiB to 1MiB?


I think that a 1MiB default is more appropriate for most users. The new
configuration slows things down (slightly) in a way which does not affect
availability but is *correct* in more cases. As such I think it's more
appropriate as a default. Keeping the 8MiB value means users are more
likely to run into the liveness problem and take corrective action.
Debugging it requires knowledge of the KRaft protocol. Further, it is
already possible to mitigate the problem by increasing the leader election
timeout. The intent of setting the default to 1MiB is to make the scenario
less likely to occur given that we probably don't loose "too much" in terms
of performance.

Metadata snapshots are not typically larger than 5 MiB. The largest I have
seen was ~= 40 MiB. Further, the slow down would be a result of the
underlying TCP connection being as utilized as much as before. The
connection cannot send up to 8MiB of data inflight at any given time.
Changing the configuration from 8 -> 1 does not mean that an operation
which used to take 1 second will now take 8 seconds. It's possible that
multiple 1MiB requests can be sent per second, especially on a higher
performance network. The same underlying TCP connection is being used
regardless which means handshakes and transmission windows don't need to be
recomputed each time a request is sent.

Best,
Jonah


On Wed, Sep 24, 2025 at 1:04 PM Kevin Wu <[email protected]> wrote:

> Hi Jonah,
>
> Thanks for this KIP! Being able to configure this would definitely help
> users with slower networks from unnecessary disruptive elections. I have a
> question about the new default you're proposing.
>
> > the behavioural change in this KIP primarily reducing the (current) 8MiB
> limit to 1MiB (specified in new configurations) *by default*
> What is the motivation for lowering this default limit from 8MiB to 1MiB?
> You state that this new default value "will likely make the fetches
> slightly slower for deployments with high speed networks." Isn't that an
> argument for preserving the original default? My opinion is that the user
> should be the one setting this config to a lower value if they need it to
> be lower, not "us" via the default. This way we don't run the possibility
> of impacting existing kafkas. This means the user needs to be aware of
> KAFKA-19541 when setting up their cluster, so maybe we should just explain
> this configuration clearly in the docs, so the user can make sure things
> are configured properly.
>
> Best,
> Kevin Wu
>
> On Mon, Sep 22, 2025 at 11:31 AM José Armando García Sancio
> <[email protected]> wrote:
>
> > Hi Jonah,
> >
> > Thanks for the changes. The proposed solution looks good to me at a
> > high-level. I just have some minor comments.
> >
> > The motivation section starts with the solution without motivating the
> > problem. I think you want to delete the first paragraph in the
> > motivation section.
> >
> > In the motivation section you have "non-terminating loop where they
> > cannot join the cluster." They are technically not joining the
> > cluster. The issue is a liveness issue. Because the FETCH_SNAPSHOT RPC
> > doesn't complete within the fetch timeout configuration the fetching
> > replica cannot make progress. In the worst case, if it is a
> > voter/controller, it further disrupts the leader by becoming a
> > candidate and increasing the epoch.
> >
> > In the motivation section you have "The proposed default
> > configurations tradeoff a reduction in the throughput of data
> > transmitted by Fetch and FetchSnapshot requests with a liveness and
> > correctness improvement." This is also the solution and not part of
> > the motivation.
> >
> > In the public interface section, please make it clear that the changes
> > are the addition of two new configurations. Let's also add
> > descriptions for these configurations.
> >
> > Please fill out "Compatibility, Deprecation, and Migration Plan". Why
> > is this safe and backward compatible?
> >
> > In the test section you have "We can perform once off tests which
> > evaluate the effect of degraded networking on snapshot. It's better to
> > design system-tests which focus on correctness by using pathologically
> > low values for controller.quorum.fetch.snapshot.max.bytes and
> > controller.quorum.fetch.max.bytes. An example scenario would be a
> > controller joining a quorum with a snapshot with a known size and then
> > attempting to fetch the snapshot in small (say 64kb) or even
> > pathologically small values like 1 byte." I don't fully understand
> > this recommendation but if my understanding is correct this sounds
> > very complicated and flakey to implement. Can't we just add protocol
> > tests to KafkaRaftClient*Test that show these configurations being
> > used when handling FETC and FETCH_SNAPSHOT requests?
> >
> > I didn't comment on the proposed changes section. I'll comment after
> > the comments above are addressed.
> >
> > Thanks
> > --
> > -José
> >
>

Reply via email to