Thanks for the changes Jonah. Looking forward to the implementation of this feature.
On Fri, Sep 26, 2025 at 12:33 PM Jonah Hooper <[email protected]> wrote: > > Thanks for the feedback José! Did some work to apply the KIP to a correct > format. > > The motivation section starts with the solution without motivating the > > problem. I think you want to delete the first paragraph in the > > motivation section. > > > Done. > > In the motivation section you have "non-terminating loop where they > > cannot join the cluster." They are technically not joining the > > cluster. The issue is a liveness issue. Because the FETCH_SNAPSHOT RPC > > doesn't complete within the fetch timeout configuration the fetching > > replica cannot make progress. In the worst case, if it is a > > voter/controller, it further disrupts the leader by becoming a > > candidate and increasing the epoch. > > > Updated the motivation to reflect this. > > > > > In the public interface section, please make it clear that the changes > > are the addition of two new configurations. Let's also add descriptions for > > these configurations. > > > Added descriptions. > > Please fill out "Compatibility, Deprecation, and Migration Plan". Why > > is this safe and backward compatible? > > > > Done. > > In the test section you have "We can perform once off tests which > > evaluate the effect of degraded networking on snapshot. It's better to > > design system-tests which focus on correctness by using pathologically > > low values for controller.quorum.fetch.snapshot.max.bytes and > > controller.quorum.fetch.max.bytes. An example scenario would be a > > controller joining a quorum with a snapshot with a known size and then > > attempting to fetch the snapshot in small (say 64kb) or even > > pathologically small values like 1 byte." I don't fully understand > > this recommendation but if my understanding is correct this sounds > > very complicated and flakey to implement. Can't we just add protocol > > tests to KafkaRaftClient*Test that show these configurations being > > used when handling FETC and FETCH_SNAPSHOT requests? > > > Good point; that section is too specific. I've left it a bit more vague, > unit tests are sufficient provided we can "mock" a situation where we > retrieve less data than is available. > > Best, > Jonah > > On Mon, Sep 22, 2025 at 12:31 PM José Armando García Sancio > <[email protected]> wrote: > > > Hi Jonah, > > > > Thanks for the changes. The proposed solution looks good to me at a > > high-level. I just have some minor comments. > > > > The motivation section starts with the solution without motivating the > > problem. I think you want to delete the first paragraph in the > > motivation section. > > > > In the motivation section you have "non-terminating loop where they > > cannot join the cluster." They are technically not joining the > > cluster. The issue is a liveness issue. Because the FETCH_SNAPSHOT RPC > > doesn't complete within the fetch timeout configuration the fetching > > replica cannot make progress. In the worst case, if it is a > > voter/controller, it further disrupts the leader by becoming a > > candidate and increasing the epoch. > > > > In the motivation section you have "The proposed default > > configurations tradeoff a reduction in the throughput of data > > transmitted by Fetch and FetchSnapshot requests with a liveness and > > correctness improvement." This is also the solution and not part of > > the motivation. > > > > In the public interface section, please make it clear that the changes > > are the addition of two new configurations. Let's also add > > descriptions for these configurations. > > > > Please fill out "Compatibility, Deprecation, and Migration Plan". Why > > is this safe and backward compatible? > > > > In the test section you have "We can perform once off tests which > > evaluate the effect of degraded networking on snapshot. It's better to > > design system-tests which focus on correctness by using pathologically > > low values for controller.quorum.fetch.snapshot.max.bytes and > > controller.quorum.fetch.max.bytes. An example scenario would be a > > controller joining a quorum with a snapshot with a known size and then > > attempting to fetch the snapshot in small (say 64kb) or even > > pathologically small values like 1 byte." I don't fully understand > > this recommendation but if my understanding is correct this sounds > > very complicated and flakey to implement. Can't we just add protocol > > tests to KafkaRaftClient*Test that show these configurations being > > used when handling FETC and FETCH_SNAPSHOT requests? > > > > I didn't comment on the proposed changes section. I'll comment after > > the comments above are addressed. > > > > Thanks > > -- > > -José > > -- -José
