[ https://issues.apache.org/jira/browse/KAFKA-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15065636#comment-15065636 ]
Todd Palino commented on KAFKA-3015: ------------------------------------ [~jkreps] So yes, I'm essentially saying that I would prefer to see optimizations to the current partitioning scheme, and the addition of being able to handle single disk failures without terminating the entire broker. I would argue that this would have a higher payoff for people who do not have the ability to easily swap in new machines (as we do, or those in AWS would), because it will allow for more granular failures. Disks tend to fail more than any other component, so the ability to survive a disk failure should be attractive to anyone. As noted, there are a lot of benefits of using JBOD, and I would not argue against having good support for it. Especially as we've been moving to new hardware, we now can't take advantage of all the network we have with 10gig interfaces primarily due to limitations on disk capacity and throughput. Additionally, we'd like to move to RF=3, but can't stomach it because of the cost. If we drop the RAID 10 we will significantly increase throughput and double our storage capacity. Then we can easily move to RF=3 (using 50% more disk) with little impact. > Improve JBOD data balancing > --------------------------- > > Key: KAFKA-3015 > URL: https://issues.apache.org/jira/browse/KAFKA-3015 > Project: Kafka > Issue Type: Improvement > Reporter: Jay Kreps > > When running with multiple data directories (i.e. JBOD) we currently place > partitions entirely within one data directory. This tends to lead to poor > balancing across disks as some topics have more throughput/retention and not > all disks get data from all topics. You can't fix this problem with smarter > partition placement strategies because ultimately you don't know when a > partition is created when or how heavily it will be used (this is a subtle > point, and the tendency is to try to think of some more sophisticated way to > place partitions based on current data size but this is actually > exceptionally dangerous and can lead to much worse imbalance when creating > many partitions at once as they would all go to the disk with the least > data). We don't support online rebalancing across directories/disks so this > imbalance is a big problem and limits the usefulness of this configuration. > Implementing online rebalancing of data across disks without downtime is > actually quite hard and requires lots of I/O since you have to actually > rewrite full partitions of data. > An alternative would be to place each partition in *all* directories/drives > and round-robin *segments* within the partition across the directories. So > the layout would be something like: > drive-a/mytopic-0/ > 0000000.data > 0000000.index > 0024680.data > 0024680.index > drive-a/mytopic-0/ > 0012345.data > 0012345.index > 0036912.data > 0036912.index > This is a little harder to implement than the current approach but not very > hard, and it is a lot easier than implementing online data balancing across > disks while retaining the current approach. I think this could easily be done > in a backwards compatible way. > I think the balancing you would get from this in most cases would be good > enough to make JBOD the default configuration. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)