felipecrv opened a new issue, #39967:
URL: https://github.com/apache/arrow/issues/39967

   ### Describe the enhancement requested
   
   Operating systems don't immediately commit data provided by userland code 
into storage devices. This is usually not a problem because (1) the kernel will 
not take very long to asynchronously commit the data on its own, (2) the kernel 
mediates access to the filesystem and guarantees all processes sees the writes 
performed to the same file so far [1], and (3) applications can handle missing 
data due to power loss or kernel crashes (e.g. a file used for caching getting 
corrupted can be easily re-downloaded).
   
   Applications with more stringent durability requirements (e.g. SQLite) will 
force commit of pending data on the kernel by calling `fsync` on transaction 
commit. But even databases avoid doing this for every file and opt to `fsync` 
only a special file storing the [Write-Ahead 
Log](https://en.wikipedia.org/wiki/Write-ahead_logging) [2] containing batched 
updates. Don't think of `Sync()` as a flushing mechanism that you should always 
call —`fsync` can add a lot of unnecessary latency (in the many hundreds of 
milliseconds) and wear down storage devices.
   
   ## Network and Distributed File Systems
   
   Networked file systems usually provide commands that ensure durability of 
the pending writes on the server storage media — `Sync` would delegate to these 
commands in these cases.
   
   Distributed filesystems that rely on data replication might provide 
operations to ensure writes are propagated before returning [(Quorum 
Writes)](https://en.wikipedia.org/wiki/Quorum_(distributed_computing)). Since 
late 2020 this is not an issue with AWS S3, so `Sync` on S3 [3] files would be 
a no-op.
   
   ## Masking Latency
   
   If you must issue `Sync` calls, one way to mask the latency caused by them 
is to issue writes as soon as possible, do some other work, and only then call 
`Sync`.
   
   ```cpp
   file.Write(data);
   file.Write(more_data);
   FunctionThatTakesSomeTimeAndDoesUsefulWork();
   file.Sync();  // it's very likely the kernel has no pending data on file at 
this point and `Sync` will return quickly
   ```
   
   [1] exceptions to this exist with the use of flags like `O_DIRECT` on 
Linux's `open` syscall https://www.man7.org/linux/man-pages/man2/open.2.html
   [2] then in the event of a power loss, the database can replay the 
write-ahead log and complete any missing write to the more complex structures 
of the database like indexes
   [3] 
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to