[ https://issues.apache.org/jira/browse/CASSANDRA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228729#comment-14228729 ]
Benedict commented on CASSANDRA-7203: ------------------------------------- [~jbellis]: Are we sure that's a good policy? It's generally accepted that a lot of work (esp. that involving people, e.g. Netflix, Apple) follows a zipfian/extreme distribution. If we can avoid the most voluminous customers from degrading performance for everybody, that's surely a pretty big win? I'm not suggesting this be attacked immediately, but in the medium-to-long term it seems like a pretty decent yield - and could be applied on both read and write. If you have 1% of your data appearing in ~100% of sstables, but the other 99% appearing in only ~1% of your sstables, you're compacting an order of magnitude more often than you might otherwise need to. Perhaps [~jasobrown] and [~kohlisankalp] have an idea of how realistic this scenario is? > Flush (and Compact) High Traffic Partitions Separately > ------------------------------------------------------ > > Key: CASSANDRA-7203 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7203 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Benedict > Labels: compaction, performance > > An idea possibly worth exploring is the use of streaming count-min sketches > to collect data over the up-time of a server to estimating the velocity of > different partitions, so that high-volume partitions can be flushed > separately on the assumption that they will be much smaller in number, thus > reducing write amplification by permitting compaction independently of any > low-velocity data. > Whilst the idea is reasonably straight forward, it seems that the biggest > problem here will be defining any success metric. Obviously any workload > following an exponential/zipf/extreme distribution is likely to benefit from > such an approach, but whether or not that would translate in real terms is > another matter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)