bug#51345: dd with conv=fsync sometimes returns when its writes are still cached
I found a bit of time to work on this and installed the attached patch, which should address the issue. Thanks for reporting it.From 3368b8745046aeaa89f418f560e714b374f1a560 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 28 Jan 2022 00:01:07 -0800 Subject: [PATCH] dd: synchronize output after write errors Problem reported by Sworddragon (Bug#51345). * src/dd.c (cleanup): Synchronize output unless dd has been interrupted. (synchronize_output): New function, split out from dd_copy. Update conversions_mask so synchronization is done at most once. (main): Do not die with the output file open, since we want to be able to synchronize it before exiting. Synchronize output before exiting. --- NEWS | 3 ++ doc/coreutils.texi | 10 +++--- src/dd.c | 76 +- 3 files changed, 64 insertions(+), 25 deletions(-) diff --git a/NEWS b/NEWS index 15c9428bd..757abee15 100644 --- a/NEWS +++ b/NEWS @@ -41,6 +41,9 @@ GNU coreutils NEWS-*- outline -*- padding them with zeros to 9 digits. It uses clock_getres and clock_gettime to infer the clock resolution. + dd conv=fsync now synchronizes output even after a write error, + and similarly for dd conv=fdatasync. + timeout --foreground --kill-after=... will now exit with status 137 if the kill signal was sent, which is consistent with the behavior when the --foreground option is not specified. This allows users to diff --git a/doc/coreutils.texi b/doc/coreutils.texi index c17406550..088d1764c 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -9397,14 +9397,16 @@ Continue after read errors. @item fdatasync @opindex fdatasync @cindex synchronized data writes, before finishing -Synchronize output data just before finishing. This forces a physical -write of output data. +Synchronize output data just before finishing, +even if there were write errors. +This forces a physical write of output data. @item fsync @opindex fsync @cindex synchronized data and metadata writes, before finishing -Synchronize output data and metadata just before finishing. This -forces a physical write of output data and metadata. +Synchronize output data and metadata just before finishing, +even if there were write errors. +This forces a physical write of output data and metadata. @end table diff --git a/src/dd.c b/src/dd.c index 957ad129e..4ddc6db12 100644 --- a/src/dd.c +++ b/src/dd.c @@ -939,6 +939,8 @@ iclose (int fd) return 0; } +static int synchronize_output (void); + static void cleanup (void) { @@ -948,6 +950,13 @@ cleanup (void) alignfree (obuf); #endif + if (!interrupt_signal) +{ + int sync_status = synchronize_output (); + if (sync_status) +exit (sync_status); +} + if (iclose (STDIN_FILENO) != 0) die (EXIT_FAILURE, errno, _("closing input file %s"), quoteaf (input_file)); @@ -2377,17 +2386,33 @@ dd_copy (void) && 0 <= reported_w_bytes && reported_w_bytes < w_bytes) print_xfer_stats (0); - if ((conversions_mask & C_FDATASYNC) && ifdatasync (STDOUT_FILENO) != 0) + return exit_status; +} + +/* Synchronize output according to conversions_mask. + Do this even if w_bytes is zero, as fsync and fdatasync + flush out write requests from other processes too. + Clear bits in conversions_mask so that synchronization is done only once. + Return zero if successful, an exit status otherwise. */ + +static int +synchronize_output (void) +{ + int exit_status = 0; + int mask = conversions_mask; + conversions_mask &= ~ (C_FDATASYNC | C_FSYNC); + + if ((mask & C_FDATASYNC) && ifdatasync (STDOUT_FILENO) != 0) { if (errno != ENOSYS && errno != EINVAL) { error (0, errno, _("fdatasync failed for %s"), quoteaf (output_file)); exit_status = EXIT_FAILURE; } - conversions_mask |= C_FSYNC; + mask |= C_FSYNC; } - if ((conversions_mask & C_FSYNC) && ifsync (STDOUT_FILENO) != 0) + if ((mask & C_FSYNC) && ifsync (STDOUT_FILENO) != 0) { error (0, errno, _("fsync failed for %s"), quoteaf (output_file)); return EXIT_FAILURE; @@ -2460,6 +2485,16 @@ main (int argc, char **argv) | (conversions_mask & C_EXCL ? O_EXCL : 0) | (seek_records || (conversions_mask & C_NOTRUNC) ? 0 : O_TRUNC)); + off_t size; + if ((INT_MULTIPLY_WRAPV (seek_records, output_blocksize, ) + || INT_ADD_WRAPV (seek_bytes, size, )) + && !(conversions_mask & C_NOTRUNC)) +die (EXIT_FAILURE, 0, + _("offset too large: " + "cannot truncate to a length of seek=%"PRIdMAX"" + " (%td-byte) blocks"), + seek_records, output_blocksize); + /* Open the output file w
bug#51345:
On 10/29/21 05:38, Pádraig Brady wrote: I'm leaning towards improving this by always doing an fsync on exit if we get a read or write error and have successfully written any data, so that previously written data is sync'd as requested. Sounds like a good idea to me too.
bug#51345:
On 28/10/2021 22:59, Sworddragon wrote: Despite I'm not using Linux as main system anymore and wanted to avoid getting into too much work I found some time to do some tests as this issue bugs me just too much. You could try running the following immediately after, to see if it also returns quickly: blockdev --flushbufs /dev/sdb Yes, this command also blocks for a bit over 1 minute when this issue occurs. Right, so that suggests conv=fsync on dd was ineffective. Here is the output (I had to freely translate the strings since this Knoppix instance is only in german so they might be slightly inaccurate; Also I had to type all text since it was executed on a different system but carefully checked to not introduce any typos): root@Microknoppix:~# dd if=/dev/random of=/dev/sdb bs=1M conv=fsync status=progress 1039138816 Bytes(1,0 GB, 991 MiB) copied, 56 s, 18,5 MB/s dd: Error on writing '/dev/sdb': The device has not enough free space 999+0 records in 998+0 records out Ah right. What's probably happening is that dd is not doing the fsync because it's exiting early because of the ENOSPC from write(2). To avoid needing the buffer drain (with fsync), you can use
bug#51345:
I got an email that somebody replied to this report but it looks like they did send it to the wrong mail address (maybe they will be merged in order) but I will still reply to it: > Suggestion as a possible workaround: Have a look at random(4) and random(7), > and ask yourself whether your use of /dev/random, rather than /dev/urandom, > is really necessary for your application. If not, you might try /dev/urandom > instead and report what you observe. > > As documented in those man pages, there are good reasons to avoid using > /dev/random, not the least of which is that it blocks frequently (every time > the entropy pool is exhausted), whereas /dev/urandom does not. That alone may > explain the execution time inconsistencies you're reporting. I'm aware of the differences of both and I don't see how the use of /dev/random here could explain any of the issues: - The drastically increased writing times are caused after dd claimed about no free empty space (assuming this means dd has finished doing its explicit writing task, e.g. writing to the transparent underlying cache) and dd's times until this point were fine implying that either significant blocking has never occured or dd correcly handles blocking input files by asynchronously continuing to write to the output file to avoid unneccessary delays. - That dd returns while the buffers are not flushed can't also be explained by the use of /dev/random unless there would be some very crazy out of mind bug somewhere. But I still did some tests with /dev/urandom and conv=fsync and I did see all 3 cases too (Normal writing times that are slightly longer (133.247s); Drastically increased writing times (235.906s) and early returning from dd (56.4327s) while an immediated executed sync still blocked for over a minute). Also for the slightly delayed writing times (~133s with conv=fsync compared to ~129s with oflag=direct) I noticed that with conv=fsync the LED of the USB Thumb Drive starts to blink a few seconds after dd started showing status progress so I assume Knoppix's kernel/settings cause the cache to be flushed slightly delayed which seems more or less normal and would explain this specific case of being ~4s slower. And while I was writing this the next message got in: > Ah right. What's happening is dd is not doing the fsync() > as it's exiting early due to write(2) getting ENOSPC. But that would make not much sense for 2 reasons: 1. The error message that the space went empty is the expected behavior here and there is no rational dd should exit then instead of still calling fsync(). 2. In the most attempts after dd has thrown the error message about no free empty space it still blocked for at least over a minute and an immediated excuted sync always returned. So it looks dd sometimes does call fsync() on ENOSPC and sometimes it doesn't. And why a successfull call to fsync() then is sometimes ~2.5 slower is also a mystery. > So this is a gotcha that should at least be documented. > Though I'm leaning towards improving this by always > doing an fsync on exit if we get a read or write error > and have successfully written any data, so that > previously written data is sync'd as requested. Yes, ensuring fsync() being called seems to be the better option here. But this still leaves the question from above why dd seemingly does this sometimes already on ENOSPC? Maybe it is a race between ENOSPC and fsync() in dd causing this bug? Eventually the sometimes occuring very delayed writes (e.g. ~235s instead of ~133s) play a role here?
bug#51345:
On 28/10/2021 22:59, Sworddragon wrote: Despite I'm not using Linux as main system anymore and wanted to avoid getting into too much work I found some time to do some tests as this issue bugs me just too much. You could try running the following immediately after, to see if it also returns quickly: blockdev --flushbufs /dev/sdb Yes, this command also blocks for a bit over 1 minute when this issue occurs. Right that suggests the conv=fsync to dd was ineffective Here is the output (I had to freely translate the strings since this Knoppix instance is only in german so they might be slightly inaccurate; Also I had to type all text since it was executed on a different system but carefully checked to not introduce any typos): root@Microknoppix:~# dd if=/dev/random of=/dev/sdb bs=1M conv=fsync status=progress 1039138816 Bytes(1,0 GB, 991 MiB) copied, 56 s, 18,5 MB/s dd: Error on writing '/dev/sdb': The device has not enough free space 999+0 records in 998+0 records out Ah right. What's happening is dd is not doing the fsync() as it's exiting early due to write(2) getting ENOSPC. As you've seen you can avoid the need for fsync() to flush buffers with oflag=direct. A reason that might be faster also is that depending on your free mem, you would avoid churning the kernel caches. Another way to at least ensure the conv=fsync was effective, would be to not write too much. I.e. you could determine the exact size of the disk (with `blockdev --getsize64 /dev/sdb` for e.g.) and then use an appropriate bs= and count=. That's awkward though and difficult to do with good performance due to larger block sizes not generally aligning with the device size. So this is a gotcha that should at least be documented. Though I'm leaning towards improving this by always doing an fsync on exit if we get a read or write error and have successfully written any data, so that previously written data is sync'd as requested. cheers, Pádraig
bug#51345:
Despite I'm not using Linux as main system anymore and wanted to avoid getting into too much work I found some time to do some tests as this issue bugs me just too much. > You could try running the following immediately after, > to see if it also returns quickly: > > blockdev --flushbufs /dev/sdb Yes, this command also blocks for a bit over 1 minute when this issue occurs. Here is the output (I had to freely translate the strings since this Knoppix instance is only in german so they might be slightly inaccurate; Also I had to type all text since it was executed on a different system but carefully checked to not introduce any typos): root@Microknoppix:~# dd if=/dev/random of=/dev/sdb bs=1M conv=fsync status=progress 1039138816 Bytes(1,0 GB, 991 MiB) copied, 56 s, 18,5 MB/s dd: Error on writing '/dev/sdb': The device has not enough free space 999+0 records in 998+0 records out 1047526912 Bytes (1,0 GB, 999 MiB) copied, 57,0283 s, 18,4 MB/s root@Microknoppix:~# time blockdev --flushbufs /dev/sdb real1m9,747s user0m0,001s sys 0m0,005s (A bit offtopic, but maybe another minor bug: The first line of the copied bytes differ slightly with every attempt (like 991 MiB, 994 MiB, 997 MiB, etc.) but the last line seems to be always the same up to the last byte. I had the impression the output from status=progress does not do a final update once dd throws the writing error because the free space went out, is my assumption correct? If so, it would probably make sense and be helpful for others on debugging attempts if dd would do a final update. But back to the original topic) > Also with rates status. > > dd if=/dev/random of=/dev/sdb bs=1M oflag=direct status=progress With conv=fsync the bug usually occurs every 3-4 attempts (but the sample size is small) so I decided to do 20 attempts with oflag=direct. The issue did not appear once and every attempt was roughly 129 seconds (+/- 0.5 seconds) and an executed sync afterwards always returned immediately. Does direct I/O signal the related storage device to not use its bultin-cache? If not this implies that the USB Thumb Drive's firmware is very likely not faulty and does not contribute to this issue. But I noticed some other thing (I also noticed it in the past but it got now my attention after all the stable oflag=direct writes): With conv=fsync the writing time is sometimes around double as high as it should be. On doing now 3 attempts with it dd showed me final times of 244,382s, 239,127s and 135,389s, while the status progress always stopped after 56s. So waiting for the buffers to be flushed after dd claimed about no free empty storage space the times differ very significantly at best roughly as fast as oflag=direct and at worst ~2.5 times slower than they should be. To sum this up we have 2 issues: 1. The status progress from status=progress does not make a final update (which would be useful) - but that is offtopic and probably easy to fix so I guess there is no more for me to do here. 2. fsync returns often either early when the buffers are not fully flushed yet or it drastically increases the writing time compared to normal attempts (even the normal attempts are apparently quite a bit slower than direct I/O while probably the opposite should be the case) which is very likely either a bug in dd or something in the Linux Kernel's fsync() imlementation is very broken.
bug#51345: dd with conv=fsync sometimes returns when its writes are still cached
Sworddragon wrote: > On Knoppix 9.1 with the Linux Kernel 5.10.10-64 x86_64 and GNU Coreutils > 8.32 I wanted to overwrite my USB Thumb Drive a few times with random data > via "dd if=/dev/random of=/dev/sdb bs=1M conv=fsync". While it usually > takes ~2+ minutes to perform this action dd returned once after less than > 60 seconds which made me a bit curious. I suggest another try using oflag=direct instead of conv=fsync. dd if=/dev/random of=/dev/sdb bs=1M oflag=direct Also with rates status. dd if=/dev/random of=/dev/sdb bs=1M oflag=direct status=progress Here is the documentation for it. ‘oflag=FLAG[,FLAG]...’ ‘direct’ Use direct I/O for data, avoiding the buffer cache. Note that the kernel may impose restrictions on read or write buffer sizes. For example, with an ext4 destination file system and a Linux-based kernel, using ‘oflag=direct’ will cause writes to fail with ‘EINVAL’ if the output buffer size is not a multiple of 512. Bob
bug#51345:
> You could try running the following immediately after, > to see if it also returns quickly: > >blockdev --flushbufs /dev/sdb The issue does not reproduce always and the related USB Thumb Drive has already been prepared for and does store important data so that is not an easy task. The USB Thumb Drive is also a pretty old device (roughly 10 years or even older) with only 1 GB of storage space. When dd with conv=fsync returned after half of its usual writing time I guess it is unlikely that the controller of the USB Thumb Drive has its own dedicated 512 MiB buffer attached to it. > Well we're relying on the kernel here to not return from fync() > until appropriate. But the question is if there is a minor unobvious bug somewhere is the controlling logic of dd that might still cause such a bug. But I checked the manpages for the sync() and fsync() calls and they are actually quite interesting. fsync() describes as flushing the caches for even data retrieval after crashes/reboots. But the interesting part here is that it describes after that it blocks until the devices reports completion. But what happens if the device reports completion even if the kernel still sees cached writes in its memory-mapped area (since storage devices are like their own small computers and could lie or have faulty firmwares)? If fsync() returns early here it would not be against the documention in the manpage. sync() is here more simple as it describes itself as writing all pending modifications for file (meta)data to the underlying filesystems. If this would result returning after the device reports completion but the kernel still sees cached writes in its context that would be strictly a bug. The interesting part here is that the notes section of sync() sets sync(), syncfs() and fsync() equal in guarantees. With this information I see 3 possibilities here: 1. This is a bug in the controlling logic of dd that might not be obvious at all. 2. This is a bug in fsync() or somewhere more below in the Linux Kernel. 3. Returning early is the intended behavior of fsync() and does not strictly conflict with the manpage. If the last is the case it might be worth proposing a change to the Linux Kernel to additionally ensure that all cached writes are being sent out from the Kernel's context before a return from fsync() is possible. It would also mean that currently users can't rely on fsync() (e.g. via dd with conv=fsync) to ensure the data has been flushed - instead they would need to take additional action like executing a sync in the terminal afterwards.
bug#51345: dd with conv=fsync sometimes returns when its writes are still cached
On 23/10/2021 09:18, Sworddragon wrote: On Knoppix 9.1 with the Linux Kernel 5.10.10-64 x86_64 and GNU Coreutils 8.32 I wanted to overwrite my USB Thumb Drive a few times with random data via "dd if=/dev/random of=/dev/sdb bs=1M conv=fsync". While it usually takes ~2+ minutes to perform this action dd returned once after less than 60 seconds which made me a bit curious. On a later attempt dd returned with this command after ~1 minute again but the LED on my USB Thumb Drive was still blinking and an immediated executed sync on the terminal blocked for ~1 minute and once it returned the LED on my USB Thumb Drive also stopped blinking. Knoppix 9.1 was booted as a live system from a DVD and the only another persistent storage attached to it was an internal HDD which was already overwritten the same way via a past booting session. Unfortunately Linux is not my main operating system anymore so I randomly encountered this issue on a nearly 1 year old distribution which might be slightly outdated. But the issue is pretty severe as it can lead to significant data loss so it is worth at least reporting it (and having the "bad" behavior to always call sync after dd nonetheless will probably now continue to stay for quite a while). Well we're relying on the kernel here to not return from fync() until appropriate. You could try running the following immediately after, to see if it also returns quickly: blockdev --flushbufs /dev/sdb Note a subsequent `sync /dev/sdb` would probably not be effective, since that would only consider data written by theh sync process (which would be nothing). cheers, Pádraig
bug#51345: dd with conv=fsync sometimes returns when its writes are still cached
On Knoppix 9.1 with the Linux Kernel 5.10.10-64 x86_64 and GNU Coreutils 8.32 I wanted to overwrite my USB Thumb Drive a few times with random data via "dd if=/dev/random of=/dev/sdb bs=1M conv=fsync". While it usually takes ~2+ minutes to perform this action dd returned once after less than 60 seconds which made me a bit curious. On a later attempt dd returned with this command after ~1 minute again but the LED on my USB Thumb Drive was still blinking and an immediated executed sync on the terminal blocked for ~1 minute and once it returned the LED on my USB Thumb Drive also stopped blinking. Knoppix 9.1 was booted as a live system from a DVD and the only another persistent storage attached to it was an internal HDD which was already overwritten the same way via a past booting session. Unfortunately Linux is not my main operating system anymore so I randomly encountered this issue on a nearly 1 year old distribution which might be slightly outdated. But the issue is pretty severe as it can lead to significant data loss so it is worth at least reporting it (and having the "bad" behavior to always call sync after dd nonetheless will probably now continue to stay for quite a while).