On Sat, 1 Jun 2024 18:37:23 +0200 Tim Rühsen <tim.rueh...@gmx.de> wrote: > Hey David, > > > It appears that wget2 is getting files outside of what my regex-es > > allow, but on closer inspection, the files don't exist on my FS. > > Indeed, wget2 is acting slightly different to wget. > In this case, wget2 fetches URLs from pages outside your regex, but > will only store those matching your regex. The idea is to fetch more of > the stuff that is interesting to you. I can see why this can be > debatable. What is your opinion on this apart from "I want keep old > behavior".
I understand why wget2's behavior is different. There are use cases for it. I'm open to the new behavior, provided that there is some way to follow the original behavior. The arguments for the wget1's behavior are (in no particular order): 1: You don't want stuff that would be found by wget/wget2 if it was to search the files which are in the reject-regex. 2: The webmaster created a robots.txt that's a bit too strict. So you turn off robots but create a regex that's very similar. Thus, you don't want to get the stuff that in your reject-regex. 3: The webmaster failed to create a robots.txt file at all, but clearly the website would benefit from one. In this case, as you may notice from my reject-regex, I already have it setup to work in place of the more common rules in the robots files out there. 4: You're wasting your bandwidth and the servers on materials that you don't want. Considering how many websites complain about robot user-agents or have notes about various annoying ones in their robots.txt files, I think the last argument is a very important one to make. > > Normally, you'd get HTTP response 200, or 404, or something, but > > wget2 says that it's 0. What does that mean? > > Hm, I thought we fixed this issue already. Did you try with the latest > version from trunk/master? > > "[x] Checking <URL> ..." means that a HEAD request is made to the URL > to determine whether that page content may contain more URLs. E.g. > HTML, CSS and RSS pages are downloaded and parsed for yet unknown URLs. I just tried with the latest version of trunk/master. The problem is now gone. > > > What does "Adding URL: $URL" mean? > > It means that a URL has been found and it now is checked whether it will > be enqueued into the list of to-be-downloaded URLs. These checks are > e.g. if the URL is parsable/valid, has a known scheme (HTTP or HTTPS), > isn't already known, matches filters etc. One of the next lines will > tell you whether the URL was actually enqueued or whether it has been > sorted out (the reason is given as well). I'd prefer it if wget2 wasn't so chatty. Knowing what is DL'ed is useful, what's enqueued is okay-ish, but when wget2 starts listing off every URL it's found I find it to be too verboten IMHO. I tried -nv, but it's really quiet. Though it does give a reason, it doesn't list HTTP response codes for failed requests. E.g. Failed to open web.archive.org/web/index.html web.archive.org/web/index.html not found (2) While I'm at it, what does "(2)" mean? > If you still run into an issue with the latest wget2, it would be good > if you give information on how to reproduce. Ideally, a comand line > that everybody here can execute. If you have concerns putting that into > the public, you may email one of the maintainers directly (but don't > expect a fast response, we are just volunteers). > > Regards, Tim > <snip> I shall. Thanks, David