I got an email that somebody replied to this report but it looks like they did send it to the wrong mail address (maybe they will be merged in order) but I will still reply to it:
> Suggestion as a possible workaround: Have a look at random(4) and random(7), > and ask yourself whether your use of /dev/random, rather than /dev/urandom, > is really necessary for your application. If not, you might try /dev/urandom > instead and report what you observe. > > As documented in those man pages, there are good reasons to avoid using > /dev/random, not the least of which is that it blocks frequently (every time > the entropy pool is exhausted), whereas /dev/urandom does not. That alone may > explain the execution time inconsistencies you're reporting. I'm aware of the differences of both and I don't see how the use of /dev/random here could explain any of the issues: - The drastically increased writing times are caused after dd claimed about no free empty space (assuming this means dd has finished doing its explicit writing task, e.g. writing to the transparent underlying cache) and dd's times until this point were fine implying that either significant blocking has never occured or dd correcly handles blocking input files by asynchronously continuing to write to the output file to avoid unneccessary delays. - That dd returns while the buffers are not flushed can't also be explained by the use of /dev/random unless there would be some very crazy out of mind bug somewhere. But I still did some tests with /dev/urandom and conv=fsync and I did see all 3 cases too (Normal writing times that are slightly longer (133.247s); Drastically increased writing times (235.906s) and early returning from dd (56.4327s) while an immediated executed sync still blocked for over a minute). Also for the slightly delayed writing times (~133s with conv=fsync compared to ~129s with oflag=direct) I noticed that with conv=fsync the LED of the USB Thumb Drive starts to blink a few seconds after dd started showing status progress so I assume Knoppix's kernel/settings cause the cache to be flushed slightly delayed which seems more or less normal and would explain this specific case of being ~4s slower. And while I was writing this the next message got in: > Ah right. What's happening is dd is not doing the fsync() > as it's exiting early due to write(2) getting ENOSPC. But that would make not much sense for 2 reasons: 1. The error message that the space went empty is the expected behavior here and there is no rational dd should exit then instead of still calling fsync(). 2. In the most attempts after dd has thrown the error message about no free empty space it still blocked for at least over a minute and an immediated excuted sync always returned. So it looks dd sometimes does call fsync() on ENOSPC and sometimes it doesn't. And why a successfull call to fsync() then is sometimes ~2.5 slower is also a mystery. > So this is a gotcha that should at least be documented. > Though I'm leaning towards improving this by always > doing an fsync on exit if we get a read or write error > and have successfully written any data, so that > previously written data is sync'd as requested. Yes, ensuring fsync() being called seems to be the better option here. But this still leaves the question from above why dd seemingly does this sometimes already on ENOSPC? Maybe it is a race between ENOSPC and fsync() in dd causing this bug? Eventually the sometimes occuring very delayed writes (e.g. ~235s instead of ~133s) play a role here?