[
https://issues.apache.org/jira/browse/CASSANDRA-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978435#action_12978435
]
Peter Schuller commented on CASSANDRA-1902:
-------------------------------------------
Should it not be a matter of *not* DONTNEED:ing the relevant ranges during
write, rather than using WILLNEED? WILLNEED seems to result in read-ahead and
is presumably meant for cases where you're reading data in that is expected to
not be in cache and want to inform the kernel about the desire for read-ahead.
With respect to contiguous ranges: If possible it would be nice if the desired
contiguousness would be a configuration option. For workloads where disk I/O is
very critical, I would not be surprised if even the worst-case overhead of very
frequent posix_fadvise() calls is going to be worth it since the cost of cache
misses is so extremely high relative to the cost of a syscall.
It seems (looking again at
http://lxr.free-electrons.com/source/mm/fadvise.c#L118) that posix_fadvise()
overhead should be the usual syscall overhead plus O(n) on the number of pages.
As to accomplishing advice on contiguous ranges, I'm not sure what the best
course of action is. On a sparsely in-core sstable there is no issue. On a very
in-core sstable, DONTNEED:s are probably very often going to be so small as to
mean that any thresholding just turns it off almost completely.
I suppose there is some middle-ground somewhere where being smart about
fadvise() may be advisable.
For simplicity's sake, how about having a simple threshold; "posix_fadvise() in
such a way that no call applies to less than N pages of data". The
implementation could be as simple as skipping ("skip" meaning, "dont avoid
DONTNEED" - double negation...) ranges that do not fulfill the criteria. The
expected result I see from an operational perspective is:
* For very hot tables, hotness remains reasonable because in effect, no
DONTNEED is done. You're less efficient of course, but no worse than currently.
* For less hot tables, you start seeing an effect, with the effect being
largest on large sparsely cached tables (which is also where it is most
important).
* If you are in a position where you really want to squeeze that last bit of
hotness out of compaction even if it costs lots of syscalls, you can set the
threshold to 0.
I am operating on the assumption that the only motivation for preferring
contiguous ranges is the performance penalty of the fadvise().
> Migrate cached pages during compaction
> ---------------------------------------
>
> Key: CASSANDRA-1902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1902
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.7.1
> Reporter: T Jake Luciani
> Assignee: T Jake Luciani
> Fix For: 0.7.1
>
> Original Estimate: 32h
> Remaining Estimate: 32h
>
> Post CASSANDRA-1470 there is an opportunity to migrate cached pages from a
> pre-compacted CF during the compaction process.
> First, add a method to MmappedSegmentFile: long[] pagesInPageCache() that
> uses the posix mincore() function to detect the offsets of pages for this
> file currently in page cache.
> Then add getActiveKeys() which uses underlying pagesInPageCache() to get the
> keys actually in the page cache.
> use getActiveKeys() to detect which SSTables being compacted are in the os
> cache and make sure the subsequent pages in the new compacted SSTable are
> kept in the page cache for these keys. This will minimize the impact of
> compacting a "hot" SSTable.
> A simpler yet similar approach is described here:
> http://insights.oetiker.ch/linux/fadvise/
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.