Follow-up Comment #2, bug #62782 (group wget): The fact that wget can't properly resume website mirrors has bitten me as well, and I wonder what the (technical) reasons for that are.
From the documentation of the `-nc, --no-clobber` option: > Note that when -nc is specified, files with the suffixes .html or .htm will be loaded from the local disk and parsed as if they had been retrieved from the Web. Re-crawling local files sounds like the perfect solution to resuming a partial archive. I tried to get that to work, but it just doesn't seem to be supported for my usecase. What I tried is the following: - I don't use `--mirror`, because that implies `--timestamping`, which needs to not be enabled as to not HEAD requests for every file. I know that my partial archive is up-to-date and don't need wget to re-check. Instead, I use `--recursive` and `--level inf` - I use `--convert-links`, and I'm not sure why that would be incompatible with `-nc`. It doesn't seem to conflict with being able to just re-crawl the files from disk? - I also use `--adjust-extension` (previously `--html-extension`) and `--restrict-file-names=windows`, since the thing I'm crawling is a PHP forum with links like `showthread.php?pid=5`, which then get converted to nice filenames like `showthread.php@pid=5.html`. I understand that another challenge wget faces is that URLs may just redirect, and those need to be followed. The filename on disk will be what any redirects ultimately led to. However, let's assume some URL does redirect. That would imply that no file for the original URL would be on disk, correct? Or approaching that logic backwards: If wget checks for the presence of a file (by following the renaming rules of `--adjust-extension` and `restrict-file-names`) and it _is_ found on disk, would that not imply that it was already crawled and did not redirect? Of course, that logic only works if we assume the redirects never change, but with timestamping turned off we're already in "potentially stale" territory anyway and fine with that. Considering all this, should it not be possible to resume downloading partial archives? Even without any link cache files? Does anybody know if there are more good reasons why it doesn't work this way? Or maybe whether this could be done but just requires someone to implement it? _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?62782> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature