[
https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13933089#comment-13933089
]
Benedict commented on CASSANDRA-6746:
-------------------------------------
bq. I get that, but the reason this was introduced in the first place was
because the default behavior of (at least some) Linux kernels was to evict
other data in favor of the newly flushed.
Do you have a reference for that discussion? I couldn't find it searching JIRA.
Whilst my research can't rule this out as a possibility, it seems as though it
would be unlikely. The age of the recently written data would be low, certainly
lower than any hot data, so that once it is actually synced to disk it is
likely to be in the inactive_clean list and free for reclaim.
It's possible that the non-trickle-fsync default interplays badly with this,
with us permitting the entire sstable to hit the page cache and evict
everything else whilst the OS catches up. But without that scenario I would be
really surprised to see this behaviour of keeping written once pages over hotly
read data.
Either way, in the scenario that we are compacting hot data (probably more
likely, since amount of compaction performed to data should decline with age,
so we'll be mostly compacting younger data) the current behaviour is the worst
possible scenario, with the apparently still going strong 2.6 (and possibly
later) kernels definitely trashing the hot cache. So I think unless we detect
the kernel version and set the default based on the known better behaviour of
DONTNEED, it seems this is the better default to me. But we could perhaps
change the defaults for trickle fsync as well (say, set it to true and 100MB by
default) so that the OS has plenty of opportunity to reclaim the pages we're
writing if it needs to.
> Reads have a slow ramp up in speed
> ----------------------------------
>
> Key: CASSANDRA-6746
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6746
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Ryan McGuire
> Assignee: Benedict
> Labels: performance
> Fix For: 2.1 beta2
>
> Attachments: 2.1_vs_2.0_read.png, 6746-patched.png, 6746.txt,
> cassandra-2.0-bdplab-trial-fincore.tar.bz2,
> cassandra-2.1-bdplab-trial-fincore.tar.bz2
>
>
> On a physical four node cluister I am doing a big write and then a big read.
> The read takes a long time to ramp up to respectable speeds.
> !2.1_vs_2.0_read.png!
> [See data
> here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.json&metric=interval_op_rate&operation=stress-read&smoothing=1]
--
This message was sent by Atlassian JIRA
(v6.2#6252)