[ 
https://issues.apache.org/jira/browse/KAFKA-12793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KahnCheny updated KAFKA-12793:
------------------------------
    Description: 
When Kafka is used to build data pipeline in mission critical business 
scenarios, availability and throughput are the most important operational goals 
that need to be maintained in presence of transient or permanent local failure. 
One typical situation that requires Ops intervention is disk failure, some 
partitions have long write latency caused by extremely high disk utilization; 
since all partitions share the same buffer under the current producer thread 
model, the buffer will be filled up quickly and eventually the good partitions 
are impacted as well. The cluster level success rate and timeout ratio will 
degrade until the local infrastructure issue is resolved.

One way to mitigate this issue is to add client side mechanism to short circuit 
problematic partitions during transient failure. Similar approach is applied in 
other distributed systems and RPC frameworks.



  was:
When Kafka is used to build data pipeline in mission critical business 
scenarios, availability and throughput are the most important operational goals 
that need to be maintained in presence of transient or permanent local failure. 
One typical situation that requires Ops intervention is disk failure, some 
partitions have long write latency caused by extremely high disk utilization; 
since all partitions share the same buffer under the current producer thread 
model, the buffer will be filled up quickly and eventually the good partitions 
are impacted as well. The cluster level success rate and timeout ratio will 
degrade until the local infrastructure issue is resolved.

One way to mitigate this issue is to add client side mechanism to short circuit 
problematic partitions during transient failure. Similar approach is applied in 
other distributed systems and RPC frameworks.

 

[link title|http://example.com]


> Client-side Circuit Breaker for Partition Write Errors
> ------------------------------------------------------
>
>                 Key: KAFKA-12793
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12793
>             Project: Kafka
>          Issue Type: New Feature
>          Components: clients
>            Reporter: KahnCheny
>            Priority: Major
>
> When Kafka is used to build data pipeline in mission critical business 
> scenarios, availability and throughput are the most important operational 
> goals that need to be maintained in presence of transient or permanent local 
> failure. One typical situation that requires Ops intervention is disk 
> failure, some partitions have long write latency caused by extremely high 
> disk utilization; since all partitions share the same buffer under the 
> current producer thread model, the buffer will be filled up quickly and 
> eventually the good partitions are impacted as well. The cluster level 
> success rate and timeout ratio will degrade until the local infrastructure 
> issue is resolved.
> One way to mitigate this issue is to add client side mechanism to short 
> circuit problematic partitions during transient failure. Similar approach is 
> applied in other distributed systems and RPC frameworks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to