Hi Jonah, Thanks for the changes. The proposed solution looks good to me at a high-level. I just have some minor comments.
The motivation section starts with the solution without motivating the problem. I think you want to delete the first paragraph in the motivation section. In the motivation section you have "non-terminating loop where they cannot join the cluster." They are technically not joining the cluster. The issue is a liveness issue. Because the FETCH_SNAPSHOT RPC doesn't complete within the fetch timeout configuration the fetching replica cannot make progress. In the worst case, if it is a voter/controller, it further disrupts the leader by becoming a candidate and increasing the epoch. In the motivation section you have "The proposed default configurations tradeoff a reduction in the throughput of data transmitted by Fetch and FetchSnapshot requests with a liveness and correctness improvement." This is also the solution and not part of the motivation. In the public interface section, please make it clear that the changes are the addition of two new configurations. Let's also add descriptions for these configurations. Please fill out "Compatibility, Deprecation, and Migration Plan". Why is this safe and backward compatible? In the test section you have "We can perform once off tests which evaluate the effect of degraded networking on snapshot. It's better to design system-tests which focus on correctness by using pathologically low values for controller.quorum.fetch.snapshot.max.bytes and controller.quorum.fetch.max.bytes. An example scenario would be a controller joining a quorum with a snapshot with a known size and then attempting to fetch the snapshot in small (say 64kb) or even pathologically small values like 1 byte." I don't fully understand this recommendation but if my understanding is correct this sounds very complicated and flakey to implement. Can't we just add protocol tests to KafkaRaftClient*Test that show these configurations being used when handling FETC and FETCH_SNAPSHOT requests? I didn't comment on the proposed changes section. I'll comment after the comments above are addressed. Thanks -- -José
