[ 
https://issues.apache.org/jira/browse/CASSANDRA-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935846#action_12935846
 ] 

Peter Schuller commented on CASSANDRA-1658:
-------------------------------------------

At first I really really liked it, but then I realized a problem that takes a 
way a little bit of it and now I'm not sure. Anyways, firstly what I like: 
Couple this with posix_fadvise()/DONTNEED on the sstable's being switched 
*from*, and one would not even have to have memory for both sets of sstables in 
order to remain hot in cases where you rely on a cf being mostly or completely 
in memory.

The posix_fadvise() (and munlock() if mlock():ed sstables come into the picture 
in the future) would presumably be done at some granularity higher than rows or 
calls would be much too frequent for performance purposes. But doing so every 
few tens of MB:s or something should be fine.

In addition, on the topic of rate limiting, fsync():ing would still be required 
for rate limiting purposes under some circumstances to avoid affecting read 
latencies too much (to avoid e.g. the OS pushing out more than fits in 
battery-backed cache on a RAID controller as a result of pushing data in 
bursts).

A big downside though is this: For workloads where performance is dependent on 
the warmness of the cache with respect to the active set, this way of doing it 
would *still* imply most of the negative effects of mass cache eviction. Any 
large cf with a significant warm-up period would be highly effected.

A possible way to categorize a cf might be:

(1) Very small cf; fits in RAM with lots of margin.
(2) Smallish, just barely fits in RAM.
(3) Large; a lot larger than RAM.

On the premise that we're discussing situations where cache warmth is relevant 
the following disposition of the above cf:s with respect to an incremental 
switch-over:

(1) Works, but doesn't matter much since it fits in RAM anyway (except for 
muliple such sstables, but then see (2))
(2) Here we improve significantly by allowing us to lower the constant factor 
of RAM required relative to domain data size.
(3) Doesn't work anyway due to eviction on writes.

So really, it seems to me that for situations where you need a reasonably high 
rate of compaction, it would only work very well in (2) which is sort of a 
special case sitting in the middle on a spectrum.

You do point out that slow compaction is a potential helper here, and I agree. 
Provided that compaction is sufficiently slow that the warm-up period of the 
node is similar or less to the time spent compacting, this would indeed work 
well even in the case of (3).

I would further suggest that if you are IOPS sensitive you probably have a 
strong desire to limit compaction rate to something reasonable anyway.

It's not clear to me whether the trade-offs would tend to land on the side of 
it working well in practice or not.

A reasonably realistic example of type (3) with concrete numbers (let me know 
if I'm taking a mis-step in the calculations):

(I am about to engage on some pretty speculative stuff that terminates with 
insufficient math skills on my part; you may want to just skip the remainder of 
this comment.)

Say you have a 500 GB CF, and 16 GB of page cache in the OS. Say you have a 
warm-up period of 30 minutes on a completely cold start before you're 
comfortable taking the load. Assume that you don't want more than a ~ 25% 
impact in terms of cold IOPS during a compaction, relative to the level of 
warmness you reach after your 30 minute warm-up on node start.

Eviction will tend to be random relative to the frequency/recency of access. So 
an instant eviction of some percentage of page cache should result in a 
proportional (by some factor) percentage of IOPS.

Assume that your workload is such that 90% of reads are served from cache. This 
should mean that the factor in question is 10. I.e., a 10% eviction should 
result in a 100% increase in IOPS.

Now, if over time cache hit rates increased linearly this would mean that a 25% 
target IOPS increase during compaction translates into a 2.5% maximum eviction 
rate over the 30 minute time window. But here is where we become dependent on 
the distribution of reads and unfortunately where my math skills fail me.

But at least in the worst possible (unrealistic) case, those 2.5% over 30 
minutes translates, with a 16 GB page cache, into 400 MB/30 minutes. Compacting 
500 GB would thus take 26 days. Of course this is utterly unrealistic but 
should be an upper bound. Anyone with more math skills want to chime in on the 
expected behavior given a long-tail distribution (for example) where 30 minutes 
translates into the 90% hit rate?


> support incremental sstable switching
> -------------------------------------
>
>                 Key: CASSANDRA-1658
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1658
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Priority: Minor
>
> I have been thinking about how to minimize the impact of compaction further 
> beyond CASSANDRA-1470. 1470 deals with the impact of the compaction process 
> itself in that it avoids going through the buffer cache; however, once 
> compaction is complete you are still switching to new sstables which will 
> imply cold reads.
> Instead of switching all at once, one could keep both the old and new 
> sstables around for a bit and incrementally switch over traffic to the new 
> sstables.
> A given request would go to the new or old sstable depending on e.g. the hash 
> of the row key couple with the point in time relative to compaction 
> completion and relative to the intended target sstable switch-over.
> In terms of end-user configuration/mnemonics, one would specify, for a given 
> column family, something like "sstable transition period per gb of data" or 
> similar. The "per gb of data" would refer to the size of the newly written 
> sstable after a compaction. So; for a major compaction you would wait for a 
> very significant period of time since the entire database just went cold. For 
> a minor compaction, you would only wait for a short period of time.
> The result should be a reasonable negative impact on e.g. disk space usage, 
> but hopefully a very significant impact in terms of making the sstable 
> transition as smooth as possible for the node.
> I like this because it feels pretty simple, is not relying on OS specific 
> features or otherwise rely on specific support from the OS other than a "well 
> functioning cache mechanism", and does not imply something hugely significant 
> like writing our own page cache layer. The performance w.r.t. CPU should be 
> very small, but the improvement in terms of disk I/O should be very 
> significant for workloads where it matters.
> The feature would be optional and per-sstable (or possibly global for the 
> node).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to