Thanks for the feedback José! Did some work to apply the KIP to a correct format.
The motivation section starts with the solution without motivating the > problem. I think you want to delete the first paragraph in the > motivation section. Done. In the motivation section you have "non-terminating loop where they > cannot join the cluster." They are technically not joining the > cluster. The issue is a liveness issue. Because the FETCH_SNAPSHOT RPC > doesn't complete within the fetch timeout configuration the fetching > replica cannot make progress. In the worst case, if it is a > voter/controller, it further disrupts the leader by becoming a > candidate and increasing the epoch. Updated the motivation to reflect this. > In the public interface section, please make it clear that the changes > are the addition of two new configurations. Let's also add descriptions for > these configurations. Added descriptions. Please fill out "Compatibility, Deprecation, and Migration Plan". Why > is this safe and backward compatible? > Done. In the test section you have "We can perform once off tests which > evaluate the effect of degraded networking on snapshot. It's better to > design system-tests which focus on correctness by using pathologically > low values for controller.quorum.fetch.snapshot.max.bytes and > controller.quorum.fetch.max.bytes. An example scenario would be a > controller joining a quorum with a snapshot with a known size and then > attempting to fetch the snapshot in small (say 64kb) or even > pathologically small values like 1 byte." I don't fully understand > this recommendation but if my understanding is correct this sounds > very complicated and flakey to implement. Can't we just add protocol > tests to KafkaRaftClient*Test that show these configurations being > used when handling FETC and FETCH_SNAPSHOT requests? Good point; that section is too specific. I've left it a bit more vague, unit tests are sufficient provided we can "mock" a situation where we retrieve less data than is available. Best, Jonah On Mon, Sep 22, 2025 at 12:31 PM José Armando García Sancio <[email protected]> wrote: > Hi Jonah, > > Thanks for the changes. The proposed solution looks good to me at a > high-level. I just have some minor comments. > > The motivation section starts with the solution without motivating the > problem. I think you want to delete the first paragraph in the > motivation section. > > In the motivation section you have "non-terminating loop where they > cannot join the cluster." They are technically not joining the > cluster. The issue is a liveness issue. Because the FETCH_SNAPSHOT RPC > doesn't complete within the fetch timeout configuration the fetching > replica cannot make progress. In the worst case, if it is a > voter/controller, it further disrupts the leader by becoming a > candidate and increasing the epoch. > > In the motivation section you have "The proposed default > configurations tradeoff a reduction in the throughput of data > transmitted by Fetch and FetchSnapshot requests with a liveness and > correctness improvement." This is also the solution and not part of > the motivation. > > In the public interface section, please make it clear that the changes > are the addition of two new configurations. Let's also add > descriptions for these configurations. > > Please fill out "Compatibility, Deprecation, and Migration Plan". Why > is this safe and backward compatible? > > In the test section you have "We can perform once off tests which > evaluate the effect of degraded networking on snapshot. It's better to > design system-tests which focus on correctness by using pathologically > low values for controller.quorum.fetch.snapshot.max.bytes and > controller.quorum.fetch.max.bytes. An example scenario would be a > controller joining a quorum with a snapshot with a known size and then > attempting to fetch the snapshot in small (say 64kb) or even > pathologically small values like 1 byte." I don't fully understand > this recommendation but if my understanding is correct this sounds > very complicated and flakey to implement. Can't we just add protocol > tests to KafkaRaftClient*Test that show these configurations being > used when handling FETC and FETCH_SNAPSHOT requests? > > I didn't comment on the proposed changes section. I'll comment after > the comments above are addressed. > > Thanks > -- > -José >
