> Due to the need for doing data alignment in the application itself (you are > bypassing all the OS magic here), there is really nothing portable about > O_DIRECT. Just have a look at open(2) on linux:
[snip] > So, just within Linux you got different mechanisms for this depending on > kernel and fs in use and you need to figure out what to do yourself as the > OS will not tell you that. Don't expect this alignment stuff to be more > standardized across OSes than inside of Linux. Still find this portable? The concept of direct I/O yes. I don't have experience of what the practical portability is with respect to alignment however, so maybe those details are a problem. But things like under what circumstances which flags to postix_fadvise() actually have the desired effect doesn't feel very portable either. One might have a look at to what extent direct I/O works well in e.g. postgresql or something like that, across platforms. But maybe you're right and O_DIRECT is just not worth it. > O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O That was the intent. > scheduling and caching across multiple reads/writers in threaded apps and > separated processes which the OS may offer. This is specifically what I want to bypass. I want to bypass the operating system's caching to (1) avoid trashing the cache and (2) know that a rate limited write translates fairly well to underlying storage. Rate limiting asynchronous writes will often be less than ideal since the operating system will tend to, by design, defer writes. This aspect can of course be overcome with fsync() however. And that does not even require native code, so is a big point in its favor. But if we still need native code for posix_fadvise() anyway (for reads), then that hit is taken anyway. But sure. Perhaps posix_fadvise() in combination with regular fsync():ing on writes may be preferable to direct I/O (with fsync() being required both for rate limiting purposes if one is to combat that, and for avoiding cache eviction given the way fadvise works in Linux atm). > This can especially be a big > loss when you got servers with loads of memory for large filesystem caches > where you might find it hard to actually utilize the cache in the > application. The entire point is to bypass the cache during compaction. But this does not (unless I'm mistaken about how Cassandra works) invalidate already pre-existing caches at the Cassandra/JVM level. In addition, for large data sets (large being significantly larger than RAM size), the data pulled into cache as part of compaction is not going to be useful anyway, as is. There is the special cases where all or most data fit in RAM and having all compaction I/O go through the cache may even be preferable; but in the general case, I really don't see the advantage of having that I/O go through cache. If you do have most or all data in RAM, than certainly having all that data warm at all times is preferably to doing I/O on a cold buffer cache against sstables. But on the other hand, any use of direct I/O of fadvise() will be optional (presumably). Given that a setup whereby your performance is entirely dependent on most data being in RAM at all times, you will already have issues with e.g. cold starts of nodes. In any case, I definitely consider there to be good reasons to not rely only on operating system caching; compaction is one of these reasons both with and without direct I/o or fadvise(). > O_DIRECT was made to solve HW performance limitation on servers 10+ years > ago. It is far from an ideal solution today (but until stuff like fadvice is > implemented properly, somewhat unavoidable) I think there are pretty clear and obvious use-cases where the cache eviction implied by large bulk streaming operations on large amounts of data is not what you want (there are any number of practical situations where this has been an issue for me, if nothing else). But if I'm overlooking something that would mean that this optimization, trying to avoid eviction, is useless with Cassandra please do explain it to me :) But I'll definitely buy that posix_fadvise() is probably a cleaner solution. -- / Peter Schuller