rate limit all background I/O
-----------------------------

                 Key: CASSANDRA-1882
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1882
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
            Reporter: Peter Schuller
            Priority: Minor


There is a clear need to support rate limiting of all background I/O (e.g., 
compaction, repair). In some cases background I/O is naturally rate limited as 
a result of being CPU bottlenecked, but in all cases where the CPU is not the 
bottleneck, background streaming I/O is almost guaranteed (barring a very very 
smart RAID controller or I/O subsystem that happens to cater extremely well to 
the use case) to be detrimental to the latency and throughput of regular live 
traffic (reads).

Ways in which live traffic is negatively affected by backgrounds I/O includes:

* Indirectly by page cache eviction (see e.g. CASSANDRA-1470).
* Reads are directly detrimental when not otherwise limited for the usual 
reasons; large continuing read requests that keep coming are battling with 
latency sensitive live traffic (mostly seek bound). Mixing seek-bound latency 
critical with bulk streaming is a classic no-no for I/O scheduling.
* Writes are directly detrimental in a similar fashion.
* But in particular, writes are more difficult still: Caching effects tend to 
augment the effects because lacking any kind of fsync() or direct I/O, the 
operating system and/or RAID controller tends to defer writes when possible. 
This often leads to a very sudden throttling of the application when caches are 
filled, at which point there is potentially a huge backlog of data to write.
** This may evict a lot of data from page cache since dirty buffers cannot be 
evicted prior to being flushed out (though CASSANDRA-1470 and related will 
hopefully help here).
** In particular, one major reason why batter-backed RAID controllers are great 
is that they have the capability to "eat" storms of writes very quickly and 
schedule them pretty efficiently with respect to a concurrent continuous stream 
of reads. But this ability is defeated if we just throw data at it until 
entirely full. Instead a rate-limited approach means that data can be thrown at 
said RAID controller at a reasonable pace and it can be allowed to do its job 
of limiting the impact of those writes on reads.

I propose a mechanism whereby all such backgrounds reads are rate limited in 
terms of MB/sec throughput. There would be:

* A configuration option to state the target rate (probably a global, until 
there is support for per-cf sstable placement)
* A configuration option to state the sampling granularity. The granularity 
would have to be small enough for rate limiting to be effective (i.e., the 
amount of I/O generated in between each sample must be reasonably small) while 
large enough to not be expensive (neither in terms of gettimeofday() type 
over-head, nor in terms of causing smaller writes so that would-be streaming 
operations become seek bound). There would likely be a recommended value on the 
order of say 5 MB, with a recommendation to multiply that with the number of 
disks in the underlying device (5 MB assumes classic mechanical disks).

Because of coarse granularity (= infrequent synchronization), there should not 
be a significant overhead associated with maintaining shared global rate 
limiter for the Cassandra instance.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to