bug#51345:

Sworddragon Fri, 29 Oct 2021 06:37:17 -0700

I got an email that somebody replied to this report but it looks like they
did send it to the wrong mail address (maybe they will be merged in order)
but I will still reply to it:


> Suggestion as a possible workaround: Have a look at random(4) and
random(7),
> and ask yourself whether your use of /dev/random, rather than
/dev/urandom,
> is really necessary for your application. If not, you might try
/dev/urandom
> instead and report what you observe.
>
> As documented in those man pages, there are good reasons to avoid using
> /dev/random, not the least of which is that it blocks frequently (every
time
> the entropy pool is exhausted), whereas /dev/urandom does not. That alone
may
> explain the execution time inconsistencies you're reporting.

I'm aware of the differences of both and I don't see how the use of
/dev/random here could explain any of the issues:

- The drastically increased writing times are caused after dd claimed about
no free empty space (assuming this means dd has finished doing its explicit
writing task, e.g. writing to the transparent underlying cache) and dd's
times until this point were fine implying that either significant blocking
has never occured or dd correcly handles blocking input files by
asynchronously continuing to write to the output file to avoid unneccessary
delays.
- That dd returns while the buffers are not flushed can't also be explained
by the use of /dev/random unless there would be some very crazy out of mind
bug somewhere.

But I still did some tests with /dev/urandom and conv=fsync and I did see
all 3 cases too (Normal writing times that are slightly longer (133.247s);
Drastically increased writing times (235.906s) and early returning from dd
(56.4327s) while an immediated executed sync still blocked for over a
minute).

Also for the slightly delayed writing times (~133s with conv=fsync compared
to ~129s with oflag=direct) I noticed that with conv=fsync the LED of the
USB Thumb Drive starts to blink a few seconds after dd started showing
status progress so I assume Knoppix's kernel/settings cause the cache to be
flushed slightly delayed which seems more or less normal and would explain
this specific case of being ~4s slower.


And while I was writing this the next message got in:


> Ah right. What's happening is dd is not doing the fsync()
> as it's exiting early due to write(2) getting ENOSPC.

But that would make not much sense for 2 reasons:

1. The error message that the space went empty is the expected behavior
here and there is no rational dd should exit then instead of still calling
fsync().
2. In the most attempts after dd has thrown the error message about no free
empty space it still blocked for at least over a minute and an immediated
excuted sync always returned.

So it looks dd sometimes does call fsync() on ENOSPC and sometimes it
doesn't. And why a successfull call to fsync() then is sometimes ~2.5
slower is also a mystery.


> So this is a gotcha that should at least be documented.
> Though I'm leaning towards improving this by always
> doing an fsync on exit if we get a read or write error
> and have successfully written any data, so that
> previously written data is sync'd as requested.

Yes, ensuring fsync() being called seems to be the better option here. But
this still leaves the question from above why dd seemingly does this
sometimes already on ENOSPC? Maybe it is a race between ENOSPC and fsync()
in dd causing this bug? Eventually the sometimes occuring very delayed
writes (e.g. ~235s instead of ~133s) play a role here?

bug#51345:

Reply via email to