Hi, Thanks for confirming it's a bug. I'm currently not fluent enough in C to provide a fix myself, but I see a patch was already posted, so I hope that's satisfactory.
Cheers, Friso On Wed, Jan 17, 2018 at 3:01 PM, Darshit Shah <[email protected]> wrote: > Hi, > > This is a bug in Wget, apparently a really old one! Seems like the bug has > been > around since atleast 1997. > > Looking at the source, the issue is that Wget does a very simple suffix > matching on the actual domain and accepted domains list. This is obviously > wrong as you have just found out. > > I'm going to try and implement this correctly, but I'm currently a little > short > on time, so if anyone else wants to pick it up, please feel free to. It's > simple, use libpsl to get the proper domain name and match against that. > > > Of course, this change will require libpsl to no longer be an optional > dependency > > * Friso van Vollenhoven <[email protected]> [180117 14:40]: > > Hello all, > > > > I am trying to do a recursive download of a webpage and span multiple > hosts > > within the same domain, but not cross to other domains. The issue is that > > the crawl does extend to other domains. My full command is this: > > > > wget \ > > --recursive \ > > --no-clobber \ > > --page-requisites \ > > --adjust-extension \ > > --span-hosts \ > > --domains=scapino.nl \ > > --no-parent \ > > --tries=2 \ > > --wait=1 \ > > --random-wait \ > > --waitretry=2 \ > > --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 > Safari/537.36' \ > > https://www.scapino.nl/winkels/scapino-utrecht-510061 > > > > From this combination of --span-hosts and --domains, I would expect to > > download assets from cdn.scapino.nl and www.scapino.nl, but not other > > domains. For some reason that I don't understand, wget also starts to do > > what looks like a full crawl of the domain werkenbijscapino.nl, which is > > referenced from the original page. > > > > Any thoughts or direction would be much appreciated. > > > > I am using wget 1.18 on Debian. > > > > > > Best regards, > > Friso > > -- > Thanking You, > Darshit Shah > PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6 >
