[jira] [Commented] (CASSANDRA-7203) Flush (and Compact) High Traffic Partitions Separately

Benedict (JIRA) Sat, 29 Nov 2014 03:22:43 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228729#comment-14228729
 ]


Benedict commented on CASSANDRA-7203:
-------------------------------------

[~jbellis]: Are we sure that's a good policy? It's generally accepted that a 
lot of work (esp. that involving people, e.g. Netflix, Apple) follows a 
zipfian/extreme distribution. If we can avoid the most voluminous customers 
from degrading performance for everybody, that's surely a pretty big win? I'm 
not suggesting this be attacked immediately, but in the medium-to-long term it 
seems like a pretty decent yield - and could be applied on both read and write. 
If you have 1% of your data appearing in ~100% of sstables, but the other 99% 
appearing in only ~1% of your sstables, you're compacting an order of magnitude 
more often than you might otherwise need to.

Perhaps [~jasobrown] and [~kohlisankalp] have an idea of how realistic this 
scenario is?

> Flush (and Compact) High Traffic Partitions Separately
> ------------------------------------------------------
>
>                 Key: CASSANDRA-7203
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7203
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>              Labels: compaction, performance
>
> An idea possibly worth exploring is the use of streaming count-min sketches 
> to collect data over the up-time of a server to estimating the velocity of 
> different partitions, so that high-volume partitions can be flushed 
> separately on the assumption that they will be much smaller in number, thus 
> reducing write amplification by permitting compaction independently of any 
> low-velocity data.
> Whilst the idea is reasonably straight forward, it seems that the biggest 
> problem here will be defining any success metric. Obviously any workload 
> following an exponential/zipf/extreme distribution is likely to benefit from 
> such an approach, but whether or not that would translate in real terms is 
> another matter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7203) Flush (and Compact) High Traffic Partitions Separately

Reply via email to