[ 
https://issues.apache.org/jira/browse/CASSANDRA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975647#action_12975647
 ] 

Peter Schuller commented on CASSANDRA-1882:
-------------------------------------------

(First, haven't done further work yet because I'm away traveling and not really 
doing development.)

Jake: Thanks. However I'm pretty skeptical as io niceness only gives a very 
very coarse way of specifying what you want. So even if it worked beautifully 
in some particular case, it won't in others, and there is no good way to 
control it AFAIK.

For example, the very first test I did (writing at a fixed speed at fixed chunk 
size concurrently with seek-bound small reads) failed miserably by completely 
starving the writes (and this was *without* ionice)  until I switched away from 
cfq to noop or deadline because cfq refused to actually submit I/O requests to 
the device to do it's own scheduling based on better information (more on that 
in a future comment). The support for io nice is specific to cfq btw.

I don't want to talk too many specifics yet because I want to do some more 
testing and try a bit harder to make cfq do what I want before I start making 
claims, but I think that in general, rate limiting I/O in such a way that you 
get sufficient throughput while not having a too adverse effect on foreground 
reads is going to take some runtime tuning depending on both workload and 
hardware (e.g., lone disk vs. 6 disk RAID10 are entirely different matters). I 
think that simply telling the kernel to de-prioritize the compaction workload 
might work well in some very specific situations (exactly the right kernel 
version, io scheduler choice/parameters, workloads and underlying storage 
device), but not in general. 

More to come. Hopefully with some Python code + sysbench command lines for easy 
testing by others on differing hardware setups. (I have not yet tested with a 
real rate limited cassandra, but did testing with sysbench for reads and a 
Python writer doing chunk-size I/O with fsync(). Test done on raid5/raid10 and 
with xfs and ext4 (not all permutations). While file system choice impacts 
somewhat, all results instantly got useless once I realized the I/O scheduling 
was orders of magnitude more important.


> rate limit all background I/O
> -----------------------------
>
>                 Key: CASSANDRA-1882
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1882
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>             Fix For: 0.7.1
>
>
> There is a clear need to support rate limiting of all background I/O (e.g., 
> compaction, repair). In some cases background I/O is naturally rate limited 
> as a result of being CPU bottlenecked, but in all cases where the CPU is not 
> the bottleneck, background streaming I/O is almost guaranteed (barring a very 
> very smart RAID controller or I/O subsystem that happens to cater extremely 
> well to the use case) to be detrimental to the latency and throughput of 
> regular live traffic (reads).
> Ways in which live traffic is negatively affected by backgrounds I/O includes:
> * Indirectly by page cache eviction (see e.g. CASSANDRA-1470).
> * Reads are directly detrimental when not otherwise limited for the usual 
> reasons; large continuing read requests that keep coming are battling with 
> latency sensitive live traffic (mostly seek bound). Mixing seek-bound latency 
> critical with bulk streaming is a classic no-no for I/O scheduling.
> * Writes are directly detrimental in a similar fashion.
> * But in particular, writes are more difficult still: Caching effects tend to 
> augment the effects because lacking any kind of fsync() or direct I/O, the 
> operating system and/or RAID controller tends to defer writes when possible. 
> This often leads to a very sudden throttling of the application when caches 
> are filled, at which point there is potentially a huge backlog of data to 
> write.
> ** This may evict a lot of data from page cache since dirty buffers cannot be 
> evicted prior to being flushed out (though CASSANDRA-1470 and related will 
> hopefully help here).
> ** In particular, one major reason why batter-backed RAID controllers are 
> great is that they have the capability to "eat" storms of writes very quickly 
> and schedule them pretty efficiently with respect to a concurrent continuous 
> stream of reads. But this ability is defeated if we just throw data at it 
> until entirely full. Instead a rate-limited approach means that data can be 
> thrown at said RAID controller at a reasonable pace and it can be allowed to 
> do its job of limiting the impact of those writes on reads.
> I propose a mechanism whereby all such backgrounds reads are rate limited in 
> terms of MB/sec throughput. There would be:
> * A configuration option to state the target rate (probably a global, until 
> there is support for per-cf sstable placement)
> * A configuration option to state the sampling granularity. The granularity 
> would have to be small enough for rate limiting to be effective (i.e., the 
> amount of I/O generated in between each sample must be reasonably small) 
> while large enough to not be expensive (neither in terms of gettimeofday() 
> type over-head, nor in terms of causing smaller writes so that would-be 
> streaming operations become seek bound). There would likely be a recommended 
> value on the order of say 5 MB, with a recommendation to multiply that with 
> the number of disks in the underlying device (5 MB assumes classic mechanical 
> disks).
> Because of coarse granularity (= infrequent synchronization), there should 
> not be a significant overhead associated with maintaining shared global rate 
> limiter for the Cassandra instance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to