Re: wget2: How to understand log output

David Niklas Tue, 04 Jun 2024 12:29:33 -0700

On Sat, 1 Jun 2024 18:37:23 +0200
Tim Rühsen <[email protected]> wrote:
> Hey David,
> 
>  >  It appears that wget2 is getting files outside of what my regex-es
>  > allow, but on closer inspection, the files don't exist on my FS.  
> 
> Indeed, wget2 is acting slightly different to wget.
> In this case, wget2 fetches URLs from pages outside your regex, but
> will only store those matching your regex. The idea is to fetch more of
> the stuff that is interesting to you. I can see why this can be
> debatable. What is your opinion on this apart from "I want keep old
> behavior".


I understand why wget2's behavior is different. There are use cases for
it. I'm open to the new behavior, provided that there is some way to
follow the original behavior.

The arguments for the wget1's behavior are (in no particular order):

1: You don't want stuff that would be found by wget/wget2 if it was to
search the files which are in the reject-regex.

2: The webmaster created a robots.txt that's a bit too strict. So you
turn off robots but create a regex that's very similar. Thus, you don't
want to get the stuff that in your reject-regex.

3: The webmaster failed to create a robots.txt file at all, but clearly
the website would benefit from one. In this case, as you may notice from
my reject-regex, I already have it setup to work in place of the more
common rules in the robots files out there.

4: You're wasting your bandwidth and the servers on materials that you
don't want.

Considering how many websites complain about robot user-agents or have
notes about various annoying ones in their robots.txt files, I think the
last argument is a very important one to make.

>  > Normally, you'd get HTTP response 200, or 404, or something, but
>  > wget2 says that it's 0. What does that mean?  
> 
> Hm, I thought we fixed this issue already. Did you try with the latest
> version from trunk/master?
>
> "[x] Checking <URL> ..." means that a HEAD request is made to the URL
> to determine whether that page content may contain more URLs. E.g.
> HTML, CSS and RSS pages are downloaded and parsed for yet unknown URLs.

I just tried with the latest version of trunk/master. The problem is now
gone.

> 
>  > What does "Adding URL: $URL" mean?  
> 
> It means that a URL has been found and it now is checked whether it will
> be enqueued into the list of to-be-downloaded URLs. These checks are
> e.g. if the URL is parsable/valid, has a known scheme (HTTP or HTTPS),
> isn't already known, matches filters etc. One of the next lines will
> tell you whether the URL was actually enqueued or whether it has been
> sorted out (the reason is given as well).

I'd prefer it if wget2 wasn't so chatty. Knowing what is DL'ed is useful,
what's enqueued is okay-ish, but when wget2 starts listing off every URL
it's found I find it to be too verboten IMHO.

I tried -nv, but it's really quiet. Though it does give a reason, it
doesn't list HTTP response codes for failed requests. E.g.

Failed to open web.archive.org/web/index.html
web.archive.org/web/index.html not found (2)

While I'm at it, what does "(2)" mean?

> If you still run into an issue with the latest wget2, it would be good
> if you give information on how to reproduce. Ideally, a comand line
> that everybody here can execute. If you have concerns putting that into
> the public, you may email one of the maintainers directly (but don't
> expect a fast response, we are just volunteers).
> 
> Regards, Tim
> 
<snip>

I shall.

Thanks,
David

Re: wget2: How to understand log output

Reply via email to