[ https://issues.apache.org/jira/browse/KAFKA-12793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
KahnCheny updated KAFKA-12793: ------------------------------ Attachment: (was: KAFKA-12793__KIP-693,_Client-side_supports_partitioned_circuit_breaker.patch) > Client-side Circuit Breaker for Partition Write Errors > ------------------------------------------------------ > > Key: KAFKA-12793 > URL: https://issues.apache.org/jira/browse/KAFKA-12793 > Project: Kafka > Issue Type: New Feature > Components: clients > Reporter: KahnCheny > Priority: Major > > When Kafka is used to build data pipeline in mission critical business > scenarios, availability and throughput are the most important operational > goals that need to be maintained in presence of transient or permanent local > failure. One typical situation that requires Ops intervention is disk > failure, some partitions have long write latency caused by extremely high > disk utilization; since all partitions share the same buffer under the > current producer thread model, the buffer will be filled up quickly and > eventually the good partitions are impacted as well. The cluster level > success rate and timeout ratio will degrade until the local infrastructure > issue is resolved. > One way to mitigate this issue is to add client side mechanism to short > circuit problematic partitions during transient failure. Similar approach is > applied in other distributed systems and RPC frameworks. > We propose to add a configuration driven circuit breaking mechanism that > allows Kafka client to ‘mute’ partitions when certain condition is met. The > mechanism adds callbacks in Sender class workflow that allows to filtering > partitions based on certain policy. > The client can choose proper implementation that fits a special failure > scenario, Client-side custom implementation of Partitioner and > ProducerInterceptor > * Customize the implementation of ProducerInterceptor, and choose the > strategy to mute partitions. > * Customize the implementation of Partitioner, and choose the strategy to > filtering partitions. > Muting partitions have impact when the topic contains keyed message as > messages will be written to more than one partitions during period of > recovery. We believe this can be an explicit trade-off the application makes > between availability and message ordering. > KIP-693: > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-693%3A+Client-side+Circuit+Breaker+for+Partition+Write+Errors|https://cwiki.apache.org/confluence/display/KAFKA/KIP-693%3A+Client-side+Circuit+Breaker+for+Partition+Write+Errors] -- This message was sent by Atlassian Jira (v8.3.4#803005)