[ 
https://issues.apache.org/jira/browse/KAFKA-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417322#comment-13417322
 ] 

Jay Kreps commented on KAFKA-188:
---------------------------------

Here are a few additional thoughts on this:
1. This is actually a lot more valuable after kafka 0.8 is out since we will 
already allow replication at a higher level so the raid is less desirable. 
Patch should definitely be on 0.8 branch, though it will likely be the same for 
0.7.
2. It is worth deciding if we want to support unbalanced disk sizes and speeds. 
E.g. if you have a 7.2k RPM drive and a 10k rpm drive will we allow you to 
balance over these? I recommend we skip this for now, we can always do it 
later. So like with RAID, we will treat all drives equally.
3. I think the only change will be in LogManager. Instead of a single 
config.logDir parameter we will need logDirs, an array of directories. In 
createLog() we will need a policy that chooses the best disk on which to place 
the new log-partition.
4. There may be a few other places that assume a single log directory, may have 
to grep around and check for that. I don't think their is to much else, as 
everything else should interact through LogManager and once the Log instance is 
created it doesn't care where it's home directory is.

One approach to placement would be to always create new logs on the "least 
loaded" directory. The definition of "least loaded" is likely to be a 
heuristic. There are two things we want to balance (1) data size, (2) i/o 
throughput. If the retention policy is based on time, then size is a good proxy 
for throughput. However you could imagine having one log with very small 
retention size but very high throughput. Another problem is that the usage may 
change over time, and migration is not feasable. For example a new feature 
going through a ramped rollout might produce almost no data at first and then 
later produce gobs of data. Furthermore you might get odd results in the case 
where you manually pre-create many topics all at once as they would all end up 
on whichever directory had the least data.

I think a better strategy would be to not try to estimate the least-loaded 
partition and instead just do round-robin assignment (e.g. 
logDirs(counter.getAndIncrement() % logDirs.length)). The assumption would be 
that the number of partitions is large enough that  each topic has one 
partition on each disk.

Either of these strategies has a corner case if all data goes to a single topic 
(or one topic dominates the load distribution), and that topic has (say) 5 
local partitions and 4 data directories. In this case one directory will get 2x 
the others. However this corner case could be worked around by carefully 
aligning the partition count and the total number of data directories, so I 
don't think we need to handle it here.
                
> Support multiple data directories
> ---------------------------------
>
>                 Key: KAFKA-188
>                 URL: https://issues.apache.org/jira/browse/KAFKA-188
>             Project: Kafka
>          Issue Type: New Feature
>            Reporter: Jay Kreps
>
> Currently we allow only a single data directory. This means that a multi-disk 
> configuration needs to be a RAID array or LVM volume or something like that 
> to be mounted as a single directory.
> For a high-throughput low-reliability configuration this would mean RAID0 
> striping. Common wisdom in Hadoop land has it that a JBOD setup that just 
> mounts each disk as a separate directory and does application-level balancing 
> over these results in about 30% write-improvement. For example see this claim 
> here:
>   http://old.nabble.com/Re%3A-RAID-vs.-JBOD-p21466110.html
> It is not clear to me why this would be the case--it seems the RAID 
> controller should be able to balance writes as well as the application so it 
> may depend on the details of the setup.
> Nonetheless this would be really easy to implement, all you need to do is add 
> multiple data directories and balance partition creation over these disks.
> One problem this might cause is if a particular topic is much larger than the 
> others it might unbalance the load across the disks. The partition->disk 
> assignment policy should probably attempt to evenly spread each topic to 
> avoid this, rather than just trying keep the number of partitions balanced 
> between disks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to