On Mon, Sep 9, 2013 at 4:49 AM, Tomas Hozza <[email protected]> wrote: > ----- Original Message ----- >> Very well, if this would be possible. Right now I have no idea how to print >> something like the above. I made Tomas Hozza's test with valgrind and wget >> having debug info. I got 18x (out of 20x) SIGBUS, but on completely different >> places in the code. Within the misuse test situation, SIGBUS could occur at >> any place where memory access (read or write) allocated by wget_read_file(). >> Absolutely randomly / unpredictable if an outside process changes the file >> size and/or content at the same time. >> >> And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget). >> >> As was already said, replacing mmap by read would not crash (wget_read_file() >> reads as many bytes as there are without prior checking the length of the >> file). But without additional logic, it might read random data (many >> processes >> writing into the file at the same time, not necessarily the same data). Wget >> would try to parse / change (-k) it, the result would be broken, but no error >> would be printed. So, replacing mmap is not a solution, but maybe a part of a >> solution. >> >> Now to the possible solutions that come into my mind: >> 1. While downloading / writing data, Wget could build a checksum of the file. >> It allows checking later when re-reading the file. In this case we could >> really tell the user: hey, someone trashed our file while we are working... >> To get this working, we must remove the mmap code. >> >> 2. Using tempfiles / tempdirs only and move them to the right place. That >> would bring in some kind of atomicity, though there are still conflicts to >> solve (e.g. a second Wget instance is faster - should we overwrite existing >> files / directories). >> >> 3. Keeping html/css files in memory after downloading. These are the ones we >> later re-read to parse them for links/URLs. Writing them to disk after >> parsing >> using a tempfile and a move/rename to have atomicity. >> >> 4. Using (advisory) file-locks just helps against other Wget instances (is >> that enough ?). And with -k you have to keep the descriptor open for each >> file >> until Wget is done with downloading everything. This is not practical, since >> there could be (10-, 100-)thousands of files to be downloaded. >> >> If someone likes to work on a patch, here is my opinion: I would implement 1. >> as the least complex to code (but it needs more CPU). Point 4 is would not >> work in any cases. >> >> Regards, Tim > > Thanks for the brainstorming. The solution #1 seems as most reasonable > to me. I was thinking about 2. and 4., but there are possible issues > that you've already mentioned. > > I had a look at the source, but unfortunately the changes to create and > verify the checksum of downloaded files is not trivial.
the metalink stuff should be checking hashes, so if that was the way to go (I wouldn't know :), maybe some of that could be re-used. http://git.savannah.gnu.org/cgit/wget.git?h=parallel-wget -- (( Anthony Bryan ... Metalink [ http://www.metalinker.org ] )) Easier, More Reliable, Self Healing Downloads
