Hi Jonah,

Thanks for the changes. The proposed solution looks good to me at a
high-level. I just have some minor comments.

The motivation section starts with the solution without motivating the
problem. I think you want to delete the first paragraph in the
motivation section.

In the motivation section you have "non-terminating loop where they
cannot join the cluster." They are technically not joining the
cluster. The issue is a liveness issue. Because the FETCH_SNAPSHOT RPC
doesn't complete within the fetch timeout configuration the fetching
replica cannot make progress. In the worst case, if it is a
voter/controller, it further disrupts the leader by becoming a
candidate and increasing the epoch.

In the motivation section you have "The proposed default
configurations tradeoff a reduction in the throughput of data
transmitted by Fetch and FetchSnapshot requests with a liveness and
correctness improvement." This is also the solution and not part of
the motivation.

In the public interface section, please make it clear that the changes
are the addition of two new configurations. Let's also add
descriptions for these configurations.

Please fill out "Compatibility, Deprecation, and Migration Plan". Why
is this safe and backward compatible?

In the test section you have "We can perform once off tests which
evaluate the effect of degraded networking on snapshot. It's better to
design system-tests which focus on correctness by using pathologically
low values for controller.quorum.fetch.snapshot.max.bytes and
controller.quorum.fetch.max.bytes. An example scenario would be a
controller joining a quorum with a snapshot with a known size and then
attempting to fetch the snapshot in small (say 64kb) or even
pathologically small values like 1 byte." I don't fully understand
this recommendation but if my understanding is correct this sounds
very complicated and flakey to implement. Can't we just add protocol
tests to KafkaRaftClient*Test that show these configurations being
used when handling FETC and FETCH_SNAPSHOT requests?

I didn't comment on the proposed changes section. I'll comment after
the comments above are addressed.

Thanks
-- 
-José

Reply via email to