[
https://issues.apache.org/jira/browse/KAFKA-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
José Armando García Sancio resolved KAFKA-15312.
------------------------------------------------
Resolution: Fixed
> FileRawSnapshotWriter must flush before atomic move
> ---------------------------------------------------
>
> Key: KAFKA-15312
> URL: https://issues.apache.org/jira/browse/KAFKA-15312
> Project: Kafka
> Issue Type: Bug
> Components: kraft
> Reporter: José Armando García Sancio
> Assignee: José Armando García Sancio
> Priority: Major
> Fix For: 3.3.3, 3.6.0, 3.4.2, 3.5.2
>
>
> On ext4 file systems it is possible for KRaft to create zero-length snapshot
> files. Not all file system fsync to disk on close. For KRaft to guarantee
> that the data has made it to disk before calling rename, it needs to make
> sure that the file has been fsync.
> We have seen cases were the snapshot file has zero-length data on ext4 file
> system.
> {quote} "Delayed allocation" means that the filesystem tries to delay the
> allocation of physical disk blocks for written data for as long as possible.
> This policy brings some important performance benefits. Many files are
> short-lived; delayed allocation can keep the system from writing fleeting
> temporary files to disk at all. And, for longer-lived files, delayed
> allocation allows the kernel to accumulate more data and to allocate the
> blocks for data contiguously, speeding up both the write and any subsequent
> reads of that data. It's an important optimization which is found in most
> contemporary filesystems.
> But, if blocks have not been allocated for a file, there is no need to write
> them quickly as a security measure. Since the blocks do not yet exist, it is
> not possible to read somebody else's data from them. So ext4 will not
> (cannot) write out unallocated blocks as part of the next journal commit
> cycle. Those blocks will, instead, wait until the kernel decides to flush
> them out; at that point, physical blocks will be allocated on disk and the
> data will be made persistent. The kernel doesn't like to let file data sit
> unwritten for too long, but it can still take a minute or so (with the
> default settings) for that data to be flushed - far longer than the five
> seconds normally seen with ext3. And that is why a crash can cause the loss
> of quite a bit more data when ext4 is being used.
> {quote}
> from: [https://lwn.net/Articles/322823/]
> {quote}auto_da_alloc ( * ), noauto_da_alloc
> Many broken applications don't use fsync() when replacing existing files via
> patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/
> rename("foo.new", "foo"), or worse yet, fd = open("foo",
> O_TRUNC)/write(fd,..)/close(fd). If auto_da_alloc is enabled, ext4 will
> detect the replace-via-rename and replace-via-truncate patterns and force
> that any delayed allocation blocks are allocated such that at the next
> journal commit, in the default data=ordered mode, the data blocks of the new
> file are forced to disk before the rename() operation is committed. This
> provides roughly the same level of guarantees as ext3, and avoids the
> "zero-length" problem that can happen when a system crashes before the
> delayed allocation blocks are forced to disk.
> {quote}
> from: [https://www.kernel.org/doc/html/latest/admin-guide/ext4.html]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)