[ 
https://issues.apache.org/jira/browse/KAFKA-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065455#comment-15065455
 ] 

Jay Kreps commented on KAFKA-3015:
----------------------------------

[~guozhang] Yeah the original proposal and patch actually always placed new 
partitions on the disk with the most free space. Unfortunately it doesn't 
really work. Consider a case where you have 3 partitions and 50% utilization on 
disk A and 3 partitions and 53% on disk B and you are creating a new topic 
which will have 10000 partitions, all of these new partitions would be created 
on disk A. This would likely lead to a lot of imbalance once those 10000 
partitions started getting traffic.

[~toddpalino] Yeah whether this is useful definitely depends on how you handle 
machine failures. In AWS or in an environment where you have reserve hardware 
you might well just immediately swap in a new machine and repair or discard the 
old asynchronously. For those environments I do think good JBOD support is a 
big deal as it effectively doubles disk write throughput which is the 
bottleneck for many uses. If that's not the setup then this feature alone won't 
help you, but it's not worse than the current state. 

However maybe what you're arguing is that the current partitioning scheme could 
be combined with hypothetical in-place disk failures where, when a disk fails, 
you'd disable the partitions on that disk but keep serving the other 
partitions. The current scheme would combine well with that since each 
partition is entirely on one disk. However the proposal in this JIRA would not 
combine well since any disk failure would disable all partitions.

> Improve JBOD data balancing
> ---------------------------
>
>                 Key: KAFKA-3015
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3015
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jay Kreps
>
> When running with multiple data directories (i.e. JBOD) we currently place 
> partitions entirely within one data directory. This tends to lead to poor 
> balancing across disks as some topics have more throughput/retention and not 
> all disks get data from all topics. You can't fix this problem with smarter 
> partition placement strategies because ultimately you don't know when a 
> partition is created when or how heavily it will be used (this is a subtle 
> point, and the tendency is to try to think of some more sophisticated way to 
> place partitions based on current data size but this is actually 
> exceptionally dangerous and can lead to much worse imbalance when creating 
> many partitions at once as they would all go to the disk with the least 
> data). We don't support online rebalancing across directories/disks so this 
> imbalance is a big problem and limits the usefulness of this configuration. 
> Implementing online rebalancing of data across disks without downtime is 
> actually quite hard and requires lots of I/O since you have to actually 
> rewrite full partitions of data.
> An alternative would be to place each partition in *all* directories/drives 
> and round-robin *segments* within the partition across the directories. So 
> the layout would be something like:
>   drive-a/mytopic-0/
>       0000000.data
>       0000000.index
>       0024680.data
>       0024680.index
>   drive-a/mytopic-0/
>       0012345.data
>       0012345.index
>       0036912.data
>       0036912.index
> This is a little harder to implement than the current approach but not very 
> hard, and it is a lot easier than implementing online data balancing across 
> disks while retaining the current approach. I think this could easily be done 
> in a backwards compatible way.
> I think the balancing you would get from this in most cases would be good 
> enough to make JBOD the default configuration. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to