José Armando García Sancio created KAFKA-15312:
--------------------------------------------------

             Summary: FileRawSnapshotWriter must flush before atomic move
                 Key: KAFKA-15312
                 URL: https://issues.apache.org/jira/browse/KAFKA-15312
             Project: Kafka
          Issue Type: Bug
          Components: kraft
            Reporter: José Armando García Sancio
            Assignee: José Armando García Sancio
             Fix For: 3.6.0


Not all file system fsync to disk on close. For KRaft to guarantee that the 
data has made it to disk before calling rename it needs to make sure that the 
file has been fsync.

We have seen cases were the snapshot file has zero-length data on ext4 file 
system.
{quote} "Delayed allocation" means that the filesystem tries to delay the 
allocation of physical disk blocks for written data for as long as possible. 
This policy brings some important performance benefits. Many files are 
short-lived; delayed allocation can keep the system from writing fleeting 
temporary files to disk at all. And, for longer-lived files, delayed allocation 
allows the kernel to accumulate more data and to allocate the blocks for data 
contiguously, speeding up both the write and any subsequent reads of that data. 
It's an important optimization which is found in most contemporary filesystems.

But, if blocks have not been allocated for a file, there is no need to write 
them quickly as a security measure. Since the blocks do not yet exist, it is 
not possible to read somebody else's data from them. So ext4 will not (cannot) 
write out unallocated blocks as part of the next journal commit cycle. Those 
blocks will, instead, wait until the kernel decides to flush them out; at that 
point, physical blocks will be allocated on disk and the data will be made 
persistent. The kernel doesn't like to let file data sit unwritten for too 
long, but it can still take a minute or so (with the default settings) for that 
data to be flushed - far longer than the five seconds normally seen with ext3. 
And that is why a crash can cause the loss of quite a bit more data when ext4 
is being used. 
{quote}
from: [https://lwn.net/Articles/322823/]
{quote}auto_da_alloc(*), noauto_da_alloc

Many broken applications don't use fsync() when replacing existing files via 
patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", 
"foo"), or worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If 
auto_da_alloc is enabled, ext4 will detect the replace-via-rename and 
replace-via-truncate patterns and force that any delayed allocation blocks are 
allocated such that at the next journal commit, in the default data=ordered 
mode, the data blocks of the new file are forced to disk before the rename() 
operation is committed. This provides roughly the same level of guarantees as 
ext3, and avoids the "zero-length" problem that can happen when a system 
crashes before the delayed allocation blocks are forced to disk.
{quote}
from: [https://www.kernel.org/doc/html/latest/admin-guide/ext4.html]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to