[
https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brandon Williams updated CASSANDRA-17237:
-----------------------------------------
Workflow: Copy of Cassandra Default Workflow (was: Copy of Cassandra Bug
Workflow)
Issue Type: Improvement (was: Bug)
> Pathalogical interaction between Cassandra and readahead, particularly on
> Centos 7 VMs
> --------------------------------------------------------------------------------------
>
> Key: CASSANDRA-17237
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17237
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Daniel Cranford
> Priority: Normal
>
> Cassandra defaults to using mmap for IO, except on 32 bit systems. The config
> value `disk_access_mode` that controls this isn't even included in or
> documented in cassandra.yml.
> While this may be a reasonable default config for Cassandra, we've noticed a
> pathalogical interplay between the way Linux implements readahead for mmap,
> and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs.
> A read that misses all levels of cache in Cassandra is (typically) going to
> involve 2 IOs: 1 into the index file and one into the data file. These IOs
> will both be effectively random given the nature the mummer3 hash partitioner.
> The amount of data read from the index file IO will be relatively small,
> perhaps 4-8kb, compared to the data file IO which (assuming the entire
> partition fits in a single compressed chunk and a compression ratio of 1/2)
> will require 32kb.
> However, applications using `mmap()` have no way to tell the OS the desired
> IO size - they can only tell the OS the desired IO location - by reading from
> the mapped address and triggering a page fault. This is unlike `read()` where
> the application provides both the size and location to the OS. So for
> `mmap()` the OS has to guess how large the IO submitted to the backing device
> should be and whether the application is performing sequential or random IO
> unless the application provides hints (eg `fadvise()`, `madvise()`,
> `readahead()`).
> This is how Linux determines the size of IO for mmap during a page fault:
> * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead
> value with the faulting address in the middle of the IO, eg IO requested for
> range [fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This
> is sometimes referred to as "read around" (ie read around the faulting
> address). See
> [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989]
> * The kernel maintains a cache miss counter for the file. Every time the
> kernel submits an IO for a page fault, this counts as a miss. Every time the
> application faults in a page that is already in the pages cache (presumably
> from a previous page fault's IO) is a cache hit and decrements the counter.
> If the miss counter exceeds a threshold, the kernel stops inflating the IOs
> to the max readahead and falls back to reading a *single* 4k page for each
> page fault. See summary
> [here|https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1]
> and implementation
> [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955]
> and
> [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005]
> * This means an application that, on average, references more than one 4k
> page around the initial page fault will consistently have page fault IOs
> inflated to the maximum readahead value. Note, there is no ramping up a
> window the way there is with standard IO. The kernel only submits IOs of 1
> page and max_readahead as far as I can tell.
> Observations:
> * mmap'ed IO on Linux wastes half the IO bandwith. This may or may not be a
> big deal depending on your setup.
> * Cassandra will always have IOs inflated to the maximum readahead because
> more than 1 page is references for the data file and (depending on the size
> and cardinality of your keys) more than one page is referenced from the index
> file
> * The device's readahead is a crude system wide knob for controlling IO size.
> Cassandra cannot perform smaller IOs for the index file (unless your keyset
> is such that only 1 page from the index file needs to be referenced).
> Centos 7 VMs:
> * The default readahead for Centos 7 VMs is 4MB (as opposed to the default
> readahead for non-VM Centos 7 which is 128kb).
> * Even though this is reduced by the kernel (cf `max_sane_readahead()`) to
> something around 450k, it is still far too large for an average Cassandra
> read.
> * Even once this readahead is reduced to the recommended 64kb, standard IO
> still has a 10% performance advantage in our tests, likely because the
> readahead algorithm for standard IO is more flexible and converges on smaller
> reads from the index file and larger reads from the data file.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]