[ 
https://issues.apache.org/jira/browse/CASSANDRA-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208344#comment-14208344
 ] 

Jonathan Ellis commented on CASSANDRA-7386:
-------------------------------------------

We have three related problems.  One is that the existing design is bad at 
balancing space used (and as posted initially, I'm okay with this to a degree), 
a more serious one is that post-CASSANDRA-5605 we can actually run out of space 
because of this, and finally space-used is actually the wrong thing to optimize 
for balancing.

Starting with the last: ultimately, we want to optimize for balanced reads 
across the disks with enough space.  We shouldn't include writes in the metrics 
because writes are transient.  But trying to balance based on target disk 
readMeter is probably no more useful than disk space; we would need to take 
hotness of the source sstables into consideration as well, and compact cold 
sstables to disks with high activity and hot ones to disks with low activity.  
This is outside the scope of this ticket.

So, if balancing by disk space is the best we can do, here is an optimal 
approach: 

# Compute the total free disk space T as the sum of each disk's free space D
# For each disk, assign it D/T of the new sstables.  (Weighted random may be 
easiest.)

To ensure we never accidentally assign an sstable to a disk that doesn't have 
room for it, we should also estimate space to be used and restrict our 
candidates to disks that have room for it.  Basically, revert CASSANDRA-5605.  
But, we don't want to go back to the bad old days of being too pessimistic.  So 
our fallback is, if no disk has space for worst-case estimate, pick the disk 
with the most free space.

> JBOD threshold to prevent unbalanced disk utilization
> -----------------------------------------------------
>
>                 Key: CASSANDRA-7386
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7386
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chris Lohfink
>            Assignee: Alan Boudreault
>            Priority: Minor
>             Fix For: 2.1.3
>
>         Attachments: 7386-v1.patch, 7386v2.diff, Mappe1.ods, 
> mean-writevalue-7disks.png, patch_2_1_branch_proto.diff, 
> sstable-count-second-run.png
>
>
> Currently the pick the disks are picked first by number of current tasks, 
> then by free space.  This helps with performance but can lead to large 
> differences in utilization in some (unlikely but possible) scenarios.  Ive 
> seen 55% to 10% and heard reports of 90% to 10% on IRC.  With both LCS and 
> STCS (although my suspicion is that STCS makes it worse since harder to be 
> balanced).
> I purpose the algorithm change a little to have some maximum range of 
> utilization where it will pick by free space over load (acknowledging it can 
> be slower).  So if a disk A is 30% full and disk B is 5% full it will never 
> pick A over B until it balances out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to