[
https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188593#comment-14188593
]
Kyle Banker commented on KAFKA-1736:
------------------------------------
[~gwenshap] For users trying to mitigate against rack failures, I believe it's
actually easier to reason about those failures with broker-level replication
placement. For example, a 9 node cluster spread across three racks, with
3-factor replication, and broker-level replica placement, means that I can lose
an entire rack and still be available. With random placement, no such guarantee
can be made.
Again, we're assuming the min.isr=2, replication-factor=3 case here.
> Improve parition-broker assignment strategy for better availaility in
> majority durability modes
> -----------------------------------------------------------------------------------------------
>
> Key: KAFKA-1736
> URL: https://issues.apache.org/jira/browse/KAFKA-1736
> Project: Kafka
> Issue Type: Improvement
> Affects Versions: 0.8.1.1
> Reporter: Kyle Banker
> Priority: Minor
>
> The current random strategy of partition-to-broker distribution combined with
> a fairly typical use of min.isr and request.acks results in a suboptimal
> level of availability.
> Specifically, if all of your topics have a replication factor of 3, and you
> use min.isr=2 and required.acks=all, then regardless of the number of the
> brokers in the cluster, you can safely lose only 1 node. Losing more than 1
> node will, 95% of the time, result in the inability to write to at least one
> partition, thus rendering the cluster unavailable. As the total number of
> partitions increases, so does this probability.
> On the other hand, if partitions are distributed so that brokers are
> effectively replicas of each other, then the probability of unavailability
> when two nodes are lost is significantly decreased. This probability
> continues to decrease as the size of the cluster increases and, more
> significantly, this probability is constant with respect to the total number
> of partitions. The only requirement for getting these numbers with this
> strategy is that the number of brokers be a multiple of the replication
> factor.
> Here are of the results of some simulations I've run:
> With Random Partition Assignment
> Number of Brokers / Number of Partitions / Replication Factor / Probability
> that two randomly selected nodes will contain at least 1 of the same
> partitions
> 6 / 54 / 3 / .999
> 9 / 54 / 3 / .986
> 12 / 54 / 3 / .894
> Broker-Replica-Style Partitioning
> Number of Brokers / Number of Partitions / Replication Factor / Probability
> that two randomly selected nodes will contain at least 1 of the same
> partitions
> 6 / 54 / 3 / .424
> 9 / 54 / 3 / .228
> 12 / 54 / 3 / .168
> Adopting this strategy will greatly increase availability for users wanting
> majority-style durability and should not change current behavior as long as
> leader partitions are assigned evenly. I don't know of any negative impact
> for other use cases, as in these cases, the distribution will still be
> effectively random.
> Let me know if you'd like to see simulation code and whether a patch would be
> welcome.
> EDIT: Just to clarify, here's how the current partition assigner would assign
> 9 partitions with 3 replicas each to a 9-node cluster (broker number -> set
> of replicas).
> 0 = Some(List(2, 3, 4))
> 1 = Some(List(3, 4, 5))
> 2 = Some(List(4, 5, 6))
> 3 = Some(List(5, 6, 7))
> 4 = Some(List(6, 7, 8))
> 5 = Some(List(7, 8, 9))
> 6 = Some(List(8, 9, 1))
> 7 = Some(List(9, 1, 2))
> 8 = Some(List(1, 2, 3))
> Here's how I'm proposing they be assigned:
> 0 = Some(ArrayBuffer(8, 5, 2))
> 1 = Some(ArrayBuffer(8, 5, 2))
> 2 = Some(ArrayBuffer(8, 5, 2))
> 3 = Some(ArrayBuffer(7, 4, 1))
> 4 = Some(ArrayBuffer(7, 4, 1))
> 5 = Some(ArrayBuffer(7, 4, 1))
> 6 = Some(ArrayBuffer(6, 3, 0))
> 7 = Some(ArrayBuffer(6, 3, 0))
> 8 = Some(ArrayBuffer(6, 3, 0))
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)