Thanks for the feedback José! Did some work to apply the KIP to a correct
format.

The motivation section starts with the solution without motivating the
> problem. I think you want to delete the first paragraph in the
> motivation section.


Done.

In the motivation section you have "non-terminating loop where they
> cannot join the cluster." They are technically not joining the
> cluster. The issue is a liveness issue. Because the FETCH_SNAPSHOT RPC
> doesn't complete within the fetch timeout configuration the fetching
> replica cannot make progress. In the worst case, if it is a
> voter/controller, it further disrupts the leader by becoming a
> candidate and increasing the epoch.


Updated the motivation to reflect this.



>  In the public interface section, please make it clear that the changes
> are the addition of two new configurations. Let's also add descriptions for
> these configurations.


Added descriptions.

Please fill out "Compatibility, Deprecation, and Migration Plan". Why
> is this safe and backward compatible?
>

Done.

In the test section you have "We can perform once off tests which
> evaluate the effect of degraded networking on snapshot. It's better to
> design system-tests which focus on correctness by using pathologically
> low values for controller.quorum.fetch.snapshot.max.bytes and
> controller.quorum.fetch.max.bytes. An example scenario would be a
> controller joining a quorum with a snapshot with a known size and then
> attempting to fetch the snapshot in small (say 64kb) or even
> pathologically small values like 1 byte." I don't fully understand
> this recommendation but if my understanding is correct this sounds
> very complicated and flakey to implement. Can't we just add protocol
> tests to KafkaRaftClient*Test that show these configurations being
> used when handling FETC and FETCH_SNAPSHOT requests?


Good point; that section is too specific. I've left it a bit more vague,
unit tests are sufficient provided we can "mock" a situation where we
retrieve less data than is available.

Best,
Jonah

On Mon, Sep 22, 2025 at 12:31 PM José Armando García Sancio
<[email protected]> wrote:

> Hi Jonah,
>
> Thanks for the changes. The proposed solution looks good to me at a
> high-level. I just have some minor comments.
>
> The motivation section starts with the solution without motivating the
> problem. I think you want to delete the first paragraph in the
> motivation section.
>
> In the motivation section you have "non-terminating loop where they
> cannot join the cluster." They are technically not joining the
> cluster. The issue is a liveness issue. Because the FETCH_SNAPSHOT RPC
> doesn't complete within the fetch timeout configuration the fetching
> replica cannot make progress. In the worst case, if it is a
> voter/controller, it further disrupts the leader by becoming a
> candidate and increasing the epoch.
>
> In the motivation section you have "The proposed default
> configurations tradeoff a reduction in the throughput of data
> transmitted by Fetch and FetchSnapshot requests with a liveness and
> correctness improvement." This is also the solution and not part of
> the motivation.
>
> In the public interface section, please make it clear that the changes
> are the addition of two new configurations. Let's also add
> descriptions for these configurations.
>
> Please fill out "Compatibility, Deprecation, and Migration Plan". Why
> is this safe and backward compatible?
>
> In the test section you have "We can perform once off tests which
> evaluate the effect of degraded networking on snapshot. It's better to
> design system-tests which focus on correctness by using pathologically
> low values for controller.quorum.fetch.snapshot.max.bytes and
> controller.quorum.fetch.max.bytes. An example scenario would be a
> controller joining a quorum with a snapshot with a known size and then
> attempting to fetch the snapshot in small (say 64kb) or even
> pathologically small values like 1 byte." I don't fully understand
> this recommendation but if my understanding is correct this sounds
> very complicated and flakey to implement. Can't we just add protocol
> tests to KafkaRaftClient*Test that show these configurations being
> used when handling FETC and FETCH_SNAPSHOT requests?
>
> I didn't comment on the proposed changes section. I'll comment after
> the comments above are addressed.
>
> Thanks
> --
> -José
>

Reply via email to