[bug #62782] Cant resume mirror website using convert links with -c option. How to do it?

anonymous Mon, 09 Sep 2024 11:00:26 -0700

Follow-up Comment #2, bug #62782 (group wget):

The fact that wget can't properly resume website mirrors has bitten me as
well, and I wonder what the (technical) reasons for that are.


From the documentation of the `-nc, --no-clobber` option:

> Note that when -nc is specified, files with the suffixes .html or .htm will
be loaded from the local disk and parsed as if they had been retrieved from
the Web.

Re-crawling local files sounds like the perfect solution to resuming a partial
archive.
I tried to get that to work, but it just doesn't seem to be supported for my
usecase.
What I tried is the following:

- I don't use `--mirror`, because that implies `--timestamping`, which needs
to not be enabled as to not HEAD requests for every file.
  I know that my partial archive is up-to-date and don't need wget to
re-check.
  Instead, I use `--recursive` and `--level inf`
- I use `--convert-links`, and I'm not sure why that would be incompatible
with `-nc`.
  It doesn't seem to conflict with being able to just re-crawl the files from
disk?
- I also use `--adjust-extension` (previously `--html-extension`) and
`--restrict-file-names=windows`,
  since the thing I'm crawling is a PHP forum with links like
`showthread.php?pid=5`,
  which then get converted to nice filenames like
`showthread.php@pid=5.html`.

I understand that another challenge wget faces is that URLs may just redirect,
and those need to be followed.
The filename on disk will be what any redirects ultimately led to.

However, let's assume some URL does redirect. That would imply that no file
for the original URL would be on disk, correct?
Or approaching that logic backwards: If wget checks for the presence of a file
(by following the renaming rules of `--adjust-extension` and
`restrict-file-names`) and it _is_ found on disk, would that not imply that it
was already crawled and did not redirect?
Of course, that logic only works if we assume the redirects never change,
but with timestamping turned off we're already in "potentially stale"
territory anyway and fine with that.

Considering all this, should it not be possible to resume downloading partial
archives? Even without any link cache files?

Does anybody know if there are more good reasons why it doesn't work this
way?
Or maybe whether this could be done but just requires someone to implement
it?





    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?62782>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #62782] Cant resume mirror website using convert links with -c option. How to do it?

Reply via email to