marsupialtail commented on code in PR #13640:
URL: https://github.com/apache/arrow/pull/13640#discussion_r924803445
##########
cpp/src/arrow/io/file.cc:
##########
@@ -378,6 +378,77 @@ Status FileOutputStream::Write(const void* data, int64_t
length) {
int FileOutputStream::file_descriptor() const { return impl_->fd(); }
+// ----------------------------------------------------------------------
+// DirectFileOutputStream, change the Open, Write and Close methods from
FileOutputStream
+// Uses DirectIO for writes. Will only write out things in 4096 byte blocks.
Buffers leftover bytes
+// in an internal data structure, which will be padded to 4096 bytes and
flushed upon call to close.
+
+class DirectFileOutputStream::DirectFileOutputStreamImpl : public OSFile {
+ public:
+ Status Open(const std::string& path, bool append) {
+ const bool truncate = !append;
+ return OpenWritable(path, truncate, append, true /* write_only */, true);
+ }
+ Status Open(int fd) { return OpenWritable(fd); }
+};
+
+DirectFileOutputStream::DirectFileOutputStream() {
+ uintptr_t mask = (uintptr_t)(4095);
+ uint8_t *mem = static_cast<uint8_t *>(malloc(4096 + 4095));
+ cached_data = reinterpret_cast<uint8_t *>(
reinterpret_cast<uintptr_t>(mem+4095) & ~(mask));
Review Comment:
[directio.zip](https://github.com/apache/arrow/files/9143508/directio.zip)
OK. I updated it to avoid overflow. The naive posix_fadvise approach (where
you call this before ALL the writes) still doesn't work in controlling page
cache. (fadvise.cpp)
However I discovered if you call posix_fadvise *every* time before a write
on the *entire* region you intend to write, it works in controlling the page
cache, despite being 50% slower than O_DIRECT. This is fadvise1.cpp. If you
call posix_fadvise every time only on the next region you intend to write, it
still doesn't control the page cache.
Bottom line is posix_fadvise probably works if we manage to guess at
producing the right behavior, but it is a bit slower.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]