[ 
https://issues.apache.org/jira/browse/CASSANDRA-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978435#action_12978435
 ] 

Peter Schuller commented on CASSANDRA-1902:
-------------------------------------------

Should it not be a matter of *not* DONTNEED:ing the relevant ranges during 
write, rather than using WILLNEED? WILLNEED seems to result in read-ahead and 
is presumably meant for cases where you're reading data in that is expected to 
not be in cache and want to inform the kernel about the desire for read-ahead.

With respect to contiguous ranges: If possible it would be nice if the desired 
contiguousness would be a configuration option. For workloads where disk I/O is 
very critical, I would not be surprised if even the worst-case overhead of very 
frequent posix_fadvise() calls is going to be worth it since the cost of cache 
misses is so extremely high relative to the cost of a syscall.

It seems (looking again at 
http://lxr.free-electrons.com/source/mm/fadvise.c#L118) that posix_fadvise() 
overhead should be the usual syscall overhead plus O(n) on the number of pages.

As to accomplishing advice on contiguous ranges, I'm not sure what the best 
course of action is. On a sparsely in-core sstable there is no issue. On a very 
in-core sstable, DONTNEED:s are probably very often going to be so small as to 
mean that any thresholding just turns it off almost completely.

I suppose there is some middle-ground somewhere where being smart about 
fadvise() may be advisable.

For simplicity's sake, how about having a simple threshold; "posix_fadvise() in 
such a way that no call applies to less than N pages of data". The 
implementation could be as simple as skipping ("skip" meaning, "dont avoid 
DONTNEED" - double negation...) ranges that do not fulfill the criteria. The 
expected result I see from an operational perspective is:

* For very hot tables, hotness remains reasonable because in effect, no 
DONTNEED is done. You're less efficient of course, but no worse than currently.

* For less hot tables, you start seeing an effect, with the effect being 
largest on large sparsely cached tables (which is also where it is most 
important).

* If you are in a position where you really want to squeeze that last bit of 
hotness out of compaction even if it costs lots of syscalls, you can set the 
threshold to 0.

I am operating on the assumption that the only motivation for preferring 
contiguous ranges is the performance penalty of the fadvise().


> Migrate cached pages during compaction 
> ---------------------------------------
>
>                 Key: CASSANDRA-1902
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1902
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.7.1
>            Reporter: T Jake Luciani
>            Assignee: T Jake Luciani
>             Fix For: 0.7.1
>
>   Original Estimate: 32h
>  Remaining Estimate: 32h
>
> Post CASSANDRA-1470 there is an opportunity to migrate cached pages from a 
> pre-compacted CF during the compaction process.  
> First, add a method to MmappedSegmentFile: long[] pagesInPageCache() that 
> uses the posix mincore() function to detect the offsets of pages for this 
> file currently in page cache.
> Then add getActiveKeys() which uses underlying pagesInPageCache() to get the 
> keys actually in the page cache.
> use getActiveKeys() to detect which SSTables being compacted are in the os 
> cache and make sure the subsequent pages in the new compacted SSTable are 
> kept in the page cache for these keys. This will minimize the impact of 
> compacting a "hot" SSTable.
> A simpler yet similar approach is described here: 
> http://insights.oetiker.ch/linux/fadvise/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to