rate limit all background I/O
-----------------------------
Key: CASSANDRA-1882
URL: https://issues.apache.org/jira/browse/CASSANDRA-1882
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Peter Schuller
Priority: Minor
There is a clear need to support rate limiting of all background I/O (e.g.,
compaction, repair). In some cases background I/O is naturally rate limited as
a result of being CPU bottlenecked, but in all cases where the CPU is not the
bottleneck, background streaming I/O is almost guaranteed (barring a very very
smart RAID controller or I/O subsystem that happens to cater extremely well to
the use case) to be detrimental to the latency and throughput of regular live
traffic (reads).
Ways in which live traffic is negatively affected by backgrounds I/O includes:
* Indirectly by page cache eviction (see e.g. CASSANDRA-1470).
* Reads are directly detrimental when not otherwise limited for the usual
reasons; large continuing read requests that keep coming are battling with
latency sensitive live traffic (mostly seek bound). Mixing seek-bound latency
critical with bulk streaming is a classic no-no for I/O scheduling.
* Writes are directly detrimental in a similar fashion.
* But in particular, writes are more difficult still: Caching effects tend to
augment the effects because lacking any kind of fsync() or direct I/O, the
operating system and/or RAID controller tends to defer writes when possible.
This often leads to a very sudden throttling of the application when caches are
filled, at which point there is potentially a huge backlog of data to write.
** This may evict a lot of data from page cache since dirty buffers cannot be
evicted prior to being flushed out (though CASSANDRA-1470 and related will
hopefully help here).
** In particular, one major reason why batter-backed RAID controllers are great
is that they have the capability to "eat" storms of writes very quickly and
schedule them pretty efficiently with respect to a concurrent continuous stream
of reads. But this ability is defeated if we just throw data at it until
entirely full. Instead a rate-limited approach means that data can be thrown at
said RAID controller at a reasonable pace and it can be allowed to do its job
of limiting the impact of those writes on reads.
I propose a mechanism whereby all such backgrounds reads are rate limited in
terms of MB/sec throughput. There would be:
* A configuration option to state the target rate (probably a global, until
there is support for per-cf sstable placement)
* A configuration option to state the sampling granularity. The granularity
would have to be small enough for rate limiting to be effective (i.e., the
amount of I/O generated in between each sample must be reasonably small) while
large enough to not be expensive (neither in terms of gettimeofday() type
over-head, nor in terms of causing smaller writes so that would-be streaming
operations become seek bound). There would likely be a recommended value on the
order of say 5 MB, with a recommendation to multiply that with the number of
disks in the underlying device (5 MB assumes classic mechanical disks).
Because of coarse granularity (= infrequent synchronization), there should not
be a significant overhead associated with maintaining shared global rate
limiter for the Cassandra instance.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.