[
https://issues.apache.org/jira/browse/KAFKA-12793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
KahnCheny updated KAFKA-12793:
------------------------------
Attachment: KAFKA-12793.patch
> Client-side Circuit Breaker for Partition Write Errors
> ------------------------------------------------------
>
> Key: KAFKA-12793
> URL: https://issues.apache.org/jira/browse/KAFKA-12793
> Project: Kafka
> Issue Type: New Feature
> Components: clients
> Reporter: KahnCheny
> Priority: Major
> Attachments: KAFKA-12793.patch
>
>
> When Kafka is used to build data pipeline in mission critical business
> scenarios, availability and throughput are the most important operational
> goals that need to be maintained in presence of transient or permanent local
> failure. One typical situation that requires Ops intervention is disk
> failure, some partitions have long write latency caused by extremely high
> disk utilization; since all partitions share the same buffer under the
> current producer thread model, the buffer will be filled up quickly and
> eventually the good partitions are impacted as well. The cluster level
> success rate and timeout ratio will degrade until the local infrastructure
> issue is resolved.
> One way to mitigate this issue is to add client side mechanism to short
> circuit problematic partitions during transient failure. Similar approach is
> applied in other distributed systems and RPC frameworks.
> We propose to add a configuration driven circuit breaking mechanism that
> allows Kafka client to ‘mute’ partitions when certain condition is met. The
> mechanism adds callbacks in Sender class workflow that allows to filtering
> partitions based on certain policy.
> The client can choose proper implementation that fits a special failure
> scenario, Client-side custom implementation of Partitioner and
> ProducerInterceptor
> * Customize the implementation of ProducerInterceptor, and choose the
> strategy to mute partitions.
> * Customize the implementation of Partitioner, and choose the strategy to
> filtering partitions.
> Muting partitions have impact when the topic contains keyed message as
> messages will be written to more than one partitions during period of
> recovery. We believe this can be an explicit trade-off the application makes
> between availability and message ordering.
> KIP-693:
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-693%3A+Client-side+Circuit+Breaker+for+Partition+Write+Errors|https://cwiki.apache.org/confluence/display/KAFKA/KIP-693%3A+Client-side+Circuit+Breaker+for+Partition+Write+Errors]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)