[jira] Commented: (CASSANDRA-1470) use direct io for compaction

Peter Schuller (JIRA) Sun, 05 Sep 2010 11:37:56 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906363#action_12906363
 ]


Peter Schuller commented on CASSANDRA-1470:
-------------------------------------------

Sure. My anecdotal experience was with a use-case where the goal was to write a 
lot of data to disk (effectively copy a large file) without evicting other data 
from the page cache (which was actively being relied upon by a service on the 
same host). The goal was to only take a one-time hit on cache misses when 
switching to the new data, rather than having the process of distributing the 
data severely affect performance.

In order to do this, not only was DONTNEED required, it was required that one 
syncs the relevant data before the call. At the time I never looked at the 
kernel implementation, but based on the cross ref. referenced in the post you 
reference, DONTNEED  results in a call to invalidate_mapping_pages:

   http://lxr.free-electrons.com/source/mm/fadvise.c#L118

Which is defined in truncate.c:

   http://lxr.free-electrons.com/source/mm/truncate.c#L309

As is noted in the documentation of that function, it won't invalidate dirty 
pages (among others).

I have never actually tested whether DONTNEED has an affect for reads (I made 
an untested assumption). I could experiment some more and report back if you 
think a working posix_fadvise() solution would be preferable to direct I/O (I 
don't have a problem with the direct I/O solution, personally).

The phrasing in the man page in terms of the intent of DONTNEED seems to match 
this:
   
       "POSIX_FADV_DONTNEED attempts to free cached pages associated with the 
specified region.  This is useful, for
        example, while streaming large files.  A program may periodically 
request the kernel to free  cached  data  that  has
        already been used, so that more useful cached pages are not discarded 
instead."

In terms of the documented API of posix_fadvise(), I see where you're coming 
from and my initial reaction is that yes, NOREUSE seems like the obvoius 
choice. But on further thought I'm not so sure. Imagine if the kernel actually 
did implement NOREUSE and others by remembering it and adjusting it's behavior 
subsequent to the call. What is the expected behavior with respect to:

  (1) different threads in the same process
  (2) different file descriptors associated with the same file

The man page doesn't say much explicitly, but my informal suspicion would be 
that any such implementation would tend to be global to a process and to a 
file, independent of threads or file descriptors. That is pure speculation, but 
I suppose my point is that we don't really know (also, I just recently looked 
at a Python wrapper which even made the assumption that it wasn't per-fd).

If an implementation *where* to be like that, NOREUSE would actually be less 
suitable than DONTNEED since DONTNEED would only temporarily evict pages just 
after they were read or written while NOREUSE might potentially cause the 
kernel to avoid retaining pages for the file for *all* accesses (including live 
traffic), permanently (or at least during the compaction window assuming one 
changes the advise afterwards).

My assumption with posix_fadvise() and fsync()+DONTNEED has been that it is 
only an attempt to improve characteristics, and it won't be perfect. In 
particular, on two ends of the spectrum:

* For smaller data sets that mostly or completely fit in memory, and it is 
being relied on for performance, a compaction using fsync()+DONTNEED would not 
really help much since the entire data base is evicted from memory very quickly 
and you end up with a performance impact roughly equal to what you expect 
anyway strictly as a result of flipping the sstable switch, switching over to 
"cold" sstables.

* For very large data sets the compaction process takes a long time, and the 
data touched at any given "few minute" (choose some arbitrary time period) 
interval is a very small subset of the total data set. Thus, assuming the 
cluster is not depending on very long-term warm-up periods for performance, the 
impact should be very limited by the mere fact that the continuous live traffic 
constitutes a continual warm-up of whatever data is slowly (relative to the 
total size) evicted incrementally from the page cache. The hit at the point of 
switch-to-cold-sstable will still be taken, but until that happens the 
long-running compaction should at least have a much more limited impact.

The nice thing about direct I/O, provided that other concerns (such as 
alignment, which was mentioned on the mailing list) don't outweigh it, is that 
the semantics with respect to interaction with the page cache seems more 
obvious. I would tend to expect that a given OS+fs combination will either 
support direct I/O or not and that when supporting it would truly not interact 
with the page cache. The posix_fadvise() behavior I would not be surprised if 
it varied a lot in future kernel versions (or other OS:es)...


> use direct io for compaction
> ----------------------------
>
>                 Key: CASSANDRA-1470
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1470
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>             Fix For: 0.6.6
>
>         Attachments: 1470-v2.txt, 1470.txt
>
>
> When compaction scans through a group of sstables, it forces the data in the 
> os buffer cache being used for hot reads, which can have a dramatic negative 
> effect on performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1470) use direct io for compaction

Reply via email to