[ 
https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188566#comment-14188566
 ] 

Kyle Banker commented on KAFKA-1736:
------------------------------------

Clearly, it's important to think about how a partitioning scheme affects both 
data loss and availability. In theory, there's an even tradeoff between the 
two. Let's start with data loss.

If I set up a topic with a replication factor of 3, then I'm implicitly 
accepting the fact that 3 catastrophic node failures probably imply a loss of 
data. I agree with your intuition that the mathematically expected total data 
loss is probably the same with both partitioning strategies. Still, it should 
also be noted that for many use cases where durability matters, a partial loss 
of data is just as bad as a total loss, since a partial loss will imply an 
inconsistent data set. Say we have a 9-node cluster with 3-factor replication. 
With broker-level replication, losing the wrong 3 nodes means losing 1/3 of the 
data. With random-partition-level replication, losing any 3 nodes may mean 
losing 1/9 of the data. But again, both cases will be equally catastrophic for 
certain applications.

Overall, I believe one should be less concerned with the data loss case because 
an unrecoverable crash (i.e., total disk failure) on three separate machines at 
the same time is exceedingly rare. Additionally, data loss can always be 
mitigated by increasing the replication factor and using an appropriate backup 
strategy.

This is why I'm more concerned with availability. Clearly, network partitions, 
machine reboots, etc., are much more common than total disk failures. As I 
mentioned originally, the unfortunate and counter-intuitive problem with 
random-partition-level replication is that increasing the number of brokers 
does not increase overall availability. Once the number of partitions reaches a 
certain threshold, losing more than one node in the cluster renders it 
unavailable.

I concede that all of my topics need to have the same replication factor in 
order to get the ideal placement. But imagine a scenario where I have two types 
of topics: those that require the greatest consistency that those that don't. I 
don't see how the latter topics, having, say, a replication factor of 1, would 
be any worse off in this scheme. The first topic would get the benefits of the 
availability provided by broker-level replication, and the second topic would 
be effectively randomly distributed.

I like your proposal for a clumping strategy, and I believe we should pursue 
that further. While adding yet another configuration parameter (i.e., C in your 
example) isn't desirable, having the greater level of availability it will 
provide is.

Eager and willing to help with this if other folks like, and want to refine, 
the idea.

> Improve parition-broker assignment strategy for better availaility in 
> majority durability modes
> -----------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1736
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1736
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.8.1.1
>            Reporter: Kyle Banker
>            Priority: Minor
>
> The current random strategy of partition-to-broker distribution combined with 
> a fairly typical use of min.isr and request.acks results in a suboptimal 
> level of availability.
> Specifically, if all of your topics have a replication factor of 3, and you 
> use min.isr=2 and required.acks=all, then regardless of the number of the 
> brokers in the cluster, you can safely lose only 1 node. Losing more than 1 
> node will, 95% of the time, result in the inability to write to at least one 
> partition, thus rendering the cluster unavailable. As the total number of 
> partitions increases, so does this probability.
> On the other hand, if partitions are distributed so that brokers are 
> effectively replicas of each other, then the probability of unavailability 
> when two nodes are lost is significantly decreased. This probability 
> continues to decrease as the size of the cluster increases and, more 
> significantly, this probability is constant with respect to the total number 
> of partitions. The only requirement for getting these numbers with this 
> strategy is that the number of brokers be a multiple of the replication 
> factor.
> Here are of the results of some simulations I've run:
> With Random Partition Assignment
> Number of Brokers / Number of Partitions / Replication Factor / Probability 
> that two randomly selected nodes will contain at least 1 of the same 
> partitions
> 6  / 54 / 3 / .999
> 9  / 54 / 3 / .986
> 12 / 54 / 3 / .894
> Broker-Replica-Style Partitioning
> Number of Brokers / Number of Partitions / Replication Factor / Probability 
> that two randomly selected nodes will contain at least 1 of the same 
> partitions
> 6  / 54 / 3 / .424
> 9  / 54 / 3 / .228
> 12 / 54 / 3 / .168
> Adopting this strategy will greatly increase availability for users wanting 
> majority-style durability and should not change current behavior as long as 
> leader partitions are assigned evenly. I don't know of any negative impact 
> for other use cases, as in these cases, the distribution will still be 
> effectively random.
> Let me know if you'd like to see simulation code and whether a patch would be 
> welcome.
> EDIT: Just to clarify, here's how the current partition assigner would assign 
> 9 partitions with 3 replicas each to a 9-node cluster (broker number -> set 
> of replicas).
> 0 = Some(List(2, 3, 4))
> 1 = Some(List(3, 4, 5))
> 2 = Some(List(4, 5, 6))
> 3 = Some(List(5, 6, 7))
> 4 = Some(List(6, 7, 8))
> 5 = Some(List(7, 8, 9))
> 6 = Some(List(8, 9, 1))
> 7 = Some(List(9, 1, 2))
> 8 = Some(List(1, 2, 3))
> Here's how I'm proposing they be assigned:
> 0 = Some(ArrayBuffer(8, 5, 2))
> 1 = Some(ArrayBuffer(8, 5, 2))
> 2 = Some(ArrayBuffer(8, 5, 2))
> 3 = Some(ArrayBuffer(7, 4, 1))
> 4 = Some(ArrayBuffer(7, 4, 1))
> 5 = Some(ArrayBuffer(7, 4, 1))
> 6 = Some(ArrayBuffer(6, 3, 0))
> 7 = Some(ArrayBuffer(6, 3, 0))
> 8 = Some(ArrayBuffer(6, 3, 0))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to