> Due to the need for doing data alignment in the application itself (you are
> bypassing all the OS magic here), there is really nothing portable about
> O_DIRECT. Just have a look at open(2) on linux:

[snip]

> So, just within Linux you got different mechanisms for this depending on
> kernel and fs in use and you need to figure out what to do yourself as the
> OS will not tell you that. Don't expect this alignment stuff to be more
> standardized across OSes than inside of Linux. Still find this portable?

The concept of direct I/O yes. I don't have experience of what the
practical portability is with respect to alignment however, so maybe
those details are a problem. But things like under what circumstances
which flags to postix_fadvise() actually have the desired effect
doesn't feel very portable either.

One might have a look at to what extent direct I/O works well in e.g.
postgresql or something like that, across platforms. But maybe you're
right and O_DIRECT is just not worth it.

> O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O

That was the intent.

> scheduling and caching across multiple reads/writers in threaded apps and
> separated processes which the OS may offer.

This is specifically what I want to bypass. I want to bypass the
operating system's caching to (1) avoid trashing the cache and (2)
know that a rate limited write translates fairly well to underlying
storage. Rate limiting asynchronous writes will often be less than
ideal since the operating system will tend to, by design, defer
writes. This aspect can of course be overcome with fsync() however.
And that does not even require native code, so is a big point in its
favor. But if we still need native code for posix_fadvise() anyway
(for reads), then that hit is taken anyway.

But sure. Perhaps posix_fadvise() in combination with regular
fsync():ing on writes may be preferable to direct I/O (with fsync()
being required both for rate limiting purposes if one is to combat
that, and for avoiding cache eviction given the way fadvise works in
Linux atm).

> This can especially be a big
> loss when you got servers with loads of memory for large filesystem caches
> where you might find it hard to actually utilize the cache in the
> application.

The entire point is to bypass the cache during compaction. But this
does not (unless I'm mistaken about how Cassandra works) invalidate
already pre-existing caches at the Cassandra/JVM level. In addition,
for large data sets (large being significantly larger than RAM size),
the data pulled into cache as part of compaction is not going to be
useful anyway, as is. There is the special cases where all or most
data fit in RAM and having all compaction I/O go through the cache may
even be preferable; but in the general case, I really don't see the
advantage of having that I/O go through cache.

If you do have most or all data in RAM, than certainly having all that
data warm at all times is preferably to doing I/O on a cold buffer
cache against sstables. But on the other hand, any use of direct I/O
of fadvise() will be optional (presumably). Given that a setup whereby
your performance is entirely dependent on most data being in RAM at
all times, you will already have issues with e.g. cold starts of
nodes.

In any case, I definitely consider there to be good reasons to not
rely only on operating system caching; compaction is one of these
reasons both with and without direct I/o or fadvise().

> O_DIRECT was made to solve HW performance limitation on servers 10+ years
> ago. It is far from an ideal solution today (but until stuff like fadvice is
> implemented properly, somewhat unavoidable)

I think there are pretty clear and obvious use-cases where the cache
eviction implied by large bulk streaming operations on large amounts
of data is not what you want (there are any number of practical
situations where this has been an issue for me, if nothing else). But
if I'm overlooking something that would mean that this optimization,
trying to avoid eviction, is useless with Cassandra please do explain
it to me :)

But I'll definitely buy that posix_fadvise() is probably a cleaner solution.

-- 
/ Peter Schuller

Reply via email to