[
https://issues.apache.org/jira/browse/CASSANDRA-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906363#action_12906363
]
Peter Schuller commented on CASSANDRA-1470:
-------------------------------------------
Sure. My anecdotal experience was with a use-case where the goal was to write a
lot of data to disk (effectively copy a large file) without evicting other data
from the page cache (which was actively being relied upon by a service on the
same host). The goal was to only take a one-time hit on cache misses when
switching to the new data, rather than having the process of distributing the
data severely affect performance.
In order to do this, not only was DONTNEED required, it was required that one
syncs the relevant data before the call. At the time I never looked at the
kernel implementation, but based on the cross ref. referenced in the post you
reference, DONTNEED results in a call to invalidate_mapping_pages:
http://lxr.free-electrons.com/source/mm/fadvise.c#L118
Which is defined in truncate.c:
http://lxr.free-electrons.com/source/mm/truncate.c#L309
As is noted in the documentation of that function, it won't invalidate dirty
pages (among others).
I have never actually tested whether DONTNEED has an affect for reads (I made
an untested assumption). I could experiment some more and report back if you
think a working posix_fadvise() solution would be preferable to direct I/O (I
don't have a problem with the direct I/O solution, personally).
The phrasing in the man page in terms of the intent of DONTNEED seems to match
this:
"POSIX_FADV_DONTNEED attempts to free cached pages associated with the
specified region. This is useful, for
example, while streaming large files. A program may periodically
request the kernel to free cached data that has
already been used, so that more useful cached pages are not discarded
instead."
In terms of the documented API of posix_fadvise(), I see where you're coming
from and my initial reaction is that yes, NOREUSE seems like the obvoius
choice. But on further thought I'm not so sure. Imagine if the kernel actually
did implement NOREUSE and others by remembering it and adjusting it's behavior
subsequent to the call. What is the expected behavior with respect to:
(1) different threads in the same process
(2) different file descriptors associated with the same file
The man page doesn't say much explicitly, but my informal suspicion would be
that any such implementation would tend to be global to a process and to a
file, independent of threads or file descriptors. That is pure speculation, but
I suppose my point is that we don't really know (also, I just recently looked
at a Python wrapper which even made the assumption that it wasn't per-fd).
If an implementation *where* to be like that, NOREUSE would actually be less
suitable than DONTNEED since DONTNEED would only temporarily evict pages just
after they were read or written while NOREUSE might potentially cause the
kernel to avoid retaining pages for the file for *all* accesses (including live
traffic), permanently (or at least during the compaction window assuming one
changes the advise afterwards).
My assumption with posix_fadvise() and fsync()+DONTNEED has been that it is
only an attempt to improve characteristics, and it won't be perfect. In
particular, on two ends of the spectrum:
* For smaller data sets that mostly or completely fit in memory, and it is
being relied on for performance, a compaction using fsync()+DONTNEED would not
really help much since the entire data base is evicted from memory very quickly
and you end up with a performance impact roughly equal to what you expect
anyway strictly as a result of flipping the sstable switch, switching over to
"cold" sstables.
* For very large data sets the compaction process takes a long time, and the
data touched at any given "few minute" (choose some arbitrary time period)
interval is a very small subset of the total data set. Thus, assuming the
cluster is not depending on very long-term warm-up periods for performance, the
impact should be very limited by the mere fact that the continuous live traffic
constitutes a continual warm-up of whatever data is slowly (relative to the
total size) evicted incrementally from the page cache. The hit at the point of
switch-to-cold-sstable will still be taken, but until that happens the
long-running compaction should at least have a much more limited impact.
The nice thing about direct I/O, provided that other concerns (such as
alignment, which was mentioned on the mailing list) don't outweigh it, is that
the semantics with respect to interaction with the page cache seems more
obvious. I would tend to expect that a given OS+fs combination will either
support direct I/O or not and that when supporting it would truly not interact
with the page cache. The posix_fadvise() behavior I would not be surprised if
it varied a lot in future kernel versions (or other OS:es)...
> use direct io for compaction
> ----------------------------
>
> Key: CASSANDRA-1470
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1470
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Jonathan Ellis
> Fix For: 0.6.6
>
> Attachments: 1470-v2.txt, 1470.txt
>
>
> When compaction scans through a group of sstables, it forces the data in the
> os buffer cache being used for hot reads, which can have a dramatic negative
> effect on performance.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.