Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

Anthony Bryan Mon, 09 Sep 2013 14:25:06 -0700

On Mon, Sep 9, 2013 at 4:49 AM, Tomas Hozza <[email protected]> wrote:
> ----- Original Message -----
>> Very well, if this would be possible. Right now I have no idea how to print
>> something like the above. I made Tomas Hozza's test with valgrind and wget
>> having debug info. I got 18x (out of 20x) SIGBUS, but on completely different
>> places in the code. Within the misuse test situation, SIGBUS could occur at
>> any place where memory access (read or write) allocated by wget_read_file().
>> Absolutely randomly / unpredictable if an outside process changes the file
>> size and/or content at the same time.
>>
>> And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget).
>>
>> As was already said, replacing mmap by read would not crash (wget_read_file()
>> reads as many bytes as there are without prior checking the length of the
>> file). But without additional logic, it might read random data (many
>> processes
>> writing into the file at the same time, not necessarily the same data). Wget
>> would try to parse / change (-k) it, the result would be broken, but no error
>> would be printed. So, replacing mmap is not a solution, but maybe a part of a
>> solution.
>>
>> Now to the possible solutions that come into my mind:
>> 1. While downloading / writing data, Wget could build a checksum of the file.
>> It allows checking later when re-reading the file. In this case we could
>> really tell the user: hey, someone trashed our file while we are working...
>> To get this working, we must remove the mmap code.
>>
>> 2. Using tempfiles / tempdirs only and move them to the right place. That
>> would bring in some kind of atomicity, though there are still conflicts to
>> solve (e.g. a second Wget instance is faster - should we overwrite existing
>> files / directories).
>>
>> 3. Keeping html/css files in memory after downloading. These are the ones we
>> later re-read to parse them for links/URLs. Writing them to disk after
>> parsing
>> using a tempfile and a move/rename to have atomicity.
>>
>> 4. Using (advisory) file-locks just helps against other Wget instances (is
>> that enough ?). And with -k you have to keep the descriptor open for each
>> file
>> until Wget is done with downloading everything. This is not practical, since
>> there could be (10-, 100-)thousands of files to be downloaded.
>>
>> If someone likes to work on a patch, here is my opinion: I would implement 1.
>> as the least complex to code (but it needs more CPU). Point 4 is would not
>> work in any cases.
>>
>> Regards, Tim
>
> Thanks for the brainstorming. The solution #1 seems as most reasonable
> to me. I was thinking about 2. and 4., but there are possible issues
> that you've already mentioned.
>
> I had a look at the source, but unfortunately the changes to create and
> verify the checksum of downloaded files is not trivial.


the metalink stuff should be checking hashes, so if that was the way
to go (I wouldn't know :), maybe some of that could be re-used.

http://git.savannah.gnu.org/cgit/wget.git?h=parallel-wget

-- 
(( Anthony Bryan ... Metalink [ http://www.metalinker.org ]
  )) Easier, More Reliable, Self Healing Downloads

Re: [Bug-wget] Race condition on downloaded files among multiple wget instances

Reply via email to