[ 
https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13933089#comment-13933089
 ] 

Benedict commented on CASSANDRA-6746:
-------------------------------------

bq. I get that, but the reason this was introduced in the first place was 
because the default behavior of (at least some) Linux kernels was to evict 
other data in favor of the newly flushed.

Do you have a reference for that discussion? I couldn't find it searching JIRA.

Whilst my research can't rule this out as a possibility, it seems as though it 
would be unlikely. The age of the recently written data would be low, certainly 
lower than any hot data, so that once it is actually synced to disk it is 
likely to be in the inactive_clean list and free for reclaim.

It's possible that the non-trickle-fsync default interplays badly with this, 
with us permitting the entire sstable to hit the page cache and evict 
everything else whilst the OS catches up. But without that scenario I would be 
really surprised to see this behaviour of keeping written once pages over hotly 
read data.

Either way, in the scenario that we are compacting hot data (probably more 
likely, since amount of compaction performed to data should decline with age, 
so we'll be mostly compacting younger data) the current behaviour is the worst 
possible scenario, with the apparently still going strong 2.6 (and possibly 
later) kernels definitely trashing the hot cache. So I think unless we detect 
the kernel version and set the default based on the known better behaviour of 
DONTNEED, it seems this is the better default to me. But we could perhaps 
change the defaults for trickle fsync as well (say, set it to true and 100MB by 
default) so that the OS has plenty of opportunity to reclaim the pages we're 
writing if it needs to.

> Reads have a slow ramp up in speed
> ----------------------------------
>
>                 Key: CASSANDRA-6746
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Ryan McGuire
>            Assignee: Benedict
>              Labels: performance
>             Fix For: 2.1 beta2
>
>         Attachments: 2.1_vs_2.0_read.png, 6746-patched.png, 6746.txt, 
> cassandra-2.0-bdplab-trial-fincore.tar.bz2, 
> cassandra-2.1-bdplab-trial-fincore.tar.bz2
>
>
> On a physical four node cluister I am doing a big write and then a big read. 
> The read takes a long time to ramp up to respectable speeds.
> !2.1_vs_2.0_read.png!
> [See data 
> here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.json&metric=interval_op_rate&operation=stress-read&smoothing=1]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to