[ 
https://issues.apache.org/jira/browse/KAFKA-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417407#comment-13417407
 ] 

Jonathan Creasy edited comment on KAFKA-188 at 7/18/12 7:48 PM:
----------------------------------------------------------------

I started the implementation and my code looks much like you have described. I 
am now to the point of determining which data location to use. I am planning on 
doing a round-robin assignment for each partition. 

So, with 4 data dirs and the following topic/partition scheme:

topic1 - 2 partitions
topic2 - 4 partitions
topic3 - 1 partition
topic4 - 2 partitions

disk1 = topic1/1, topic2/3, topic4/2
disc2 = topic1/2, topic2/4
disc3 = topic2/1, topic3/1
disc4 = topic2/2, topic4/1

This is a good first step, we may want to later add re-balancing code based on 
metrics so that the "produced/consumed messages per second" are roughly 
balanced per disk. This may or may not be feasible and valuable and isn't 
really that important in this initial implementation.
                
      was (Author: jcreasy):
    I started the implementation and my code looks much like you have 
described. I am now to the point of determining which data location to use. I 
am planning on doing a round-robin assignment for each partition. 

So, with 4 data dirs and the following topic/partition scheme:

topic1 - 2 partitions
topic2 - 4 partitions
topic3 - 1 partition
topic4 - 2 partitions

disk1 = topic1/1, topic2/3, topic4,2
disc2 = topic1/2, topic2/4
disc3 = topic2/1, topic3/1
disc4 = topic2/2, topic4/1

This is a good first step, we may want to later add re-balancing code based on 
metrics so that the "produced/consumed messages per second" are roughly 
balanced per disk. This may or may not be feasible and valuable and isn't 
really that important in this initial implementation.
                  
> Support multiple data directories
> ---------------------------------
>
>                 Key: KAFKA-188
>                 URL: https://issues.apache.org/jira/browse/KAFKA-188
>             Project: Kafka
>          Issue Type: New Feature
>            Reporter: Jay Kreps
>
> Currently we allow only a single data directory. This means that a multi-disk 
> configuration needs to be a RAID array or LVM volume or something like that 
> to be mounted as a single directory.
> For a high-throughput low-reliability configuration this would mean RAID0 
> striping. Common wisdom in Hadoop land has it that a JBOD setup that just 
> mounts each disk as a separate directory and does application-level balancing 
> over these results in about 30% write-improvement. For example see this claim 
> here:
>   http://old.nabble.com/Re%3A-RAID-vs.-JBOD-p21466110.html
> It is not clear to me why this would be the case--it seems the RAID 
> controller should be able to balance writes as well as the application so it 
> may depend on the details of the setup.
> Nonetheless this would be really easy to implement, all you need to do is add 
> multiple data directories and balance partition creation over these disks.
> One problem this might cause is if a particular topic is much larger than the 
> others it might unbalance the load across the disks. The partition->disk 
> assignment policy should probably attempt to evenly spread each topic to 
> avoid this, rather than just trying keep the number of partitions balanced 
> between disks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to