westonpace commented on code in PR #13640:
URL: https://github.com/apache/arrow/pull/13640#discussion_r924994670
##########
cpp/src/arrow/io/file.cc:
##########
@@ -378,6 +378,77 @@ Status FileOutputStream::Write(const void* data, int64_t
length) {
int FileOutputStream::file_descriptor() const { return impl_->fd(); }
+// ----------------------------------------------------------------------
+// DirectFileOutputStream, change the Open, Write and Close methods from
FileOutputStream
+// Uses DirectIO for writes. Will only write out things in 4096 byte blocks.
Buffers leftover bytes
+// in an internal data structure, which will be padded to 4096 bytes and
flushed upon call to close.
+
+class DirectFileOutputStream::DirectFileOutputStreamImpl : public OSFile {
+ public:
+ Status Open(const std::string& path, bool append) {
+ const bool truncate = !append;
+ return OpenWritable(path, truncate, append, true /* write_only */, true);
+ }
+ Status Open(int fd) { return OpenWritable(fd); }
+};
+
+DirectFileOutputStream::DirectFileOutputStream() {
+ uintptr_t mask = (uintptr_t)(4095);
+ uint8_t *mem = static_cast<uint8_t *>(malloc(4096 + 4095));
+ cached_data = reinterpret_cast<uint8_t *>(
reinterpret_cast<uintptr_t>(mem+4095) & ~(mask));
Review Comment:
> Also note that any use of synchronous writes means your main thread may
block when writing
Fortunately the dataset writer has already been written with blocking writes
in mind :wink: . Writes are issued on the I/O thread pool and a backpressure
mechanism is in place. In fact, the inability of that backpressure mechanism
to prevent swapping/thrashing is exactly what motivated this JIRA in the first
place.
> My intuition is just that you may get glass jaws in real-world performance
that you won't observe using a simple micro-benchmark on an otherwise local
storage.
I'm not sure what you mean by this. Is this a concern regarding direct vs
fadvise or a concern regarding sync vs async?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]