Hi, this is not a PSL matching, so no libpsl is needed.
Just sufmatch() has to be fixed to do (sub)domain matching. Attached is a fix. With Best Regards, Tim On 01/17/2018 03:01 PM, Darshit Shah wrote: > Hi, > > This is a bug in Wget, apparently a really old one! Seems like the bug has > been > around since atleast 1997. > > Looking at the source, the issue is that Wget does a very simple suffix > matching on the actual domain and accepted domains list. This is obviously > wrong as you have just found out. > > I'm going to try and implement this correctly, but I'm currently a little > short > on time, so if anyone else wants to pick it up, please feel free to. It's > simple, use libpsl to get the proper domain name and match against that. > > > Of course, this change will require libpsl to no longer be an optional > dependency > > * Friso van Vollenhoven <[email protected]> [180117 14:40]: >> Hello all, >> >> I am trying to do a recursive download of a webpage and span multiple hosts >> within the same domain, but not cross to other domains. The issue is that >> the crawl does extend to other domains. My full command is this: >> >> wget \ >> --recursive \ >> --no-clobber \ >> --page-requisites \ >> --adjust-extension \ >> --span-hosts \ >> --domains=scapino.nl \ >> --no-parent \ >> --tries=2 \ >> --wait=1 \ >> --random-wait \ >> --waitretry=2 \ >> --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) >> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' \ >> https://www.scapino.nl/winkels/scapino-utrecht-510061 >> >> From this combination of --span-hosts and --domains, I would expect to >> download assets from cdn.scapino.nl and www.scapino.nl, but not other >> domains. For some reason that I don't understand, wget also starts to do >> what looks like a full crawl of the domain werkenbijscapino.nl, which is >> referenced from the original page. >> >> Any thoughts or direction would be much appreciated. >> >> I am using wget 1.18 on Debian. >> >> >> Best regards, >> Friso >
From 1ad636baa63cfe029c84235986626c17e4ff33cb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tim=20R=C3=BChsen?= <[email protected]> Date: Wed, 17 Jan 2018 15:50:48 +0100 Subject: [PATCH] * src/host.c (sufmatch): Fix to domain matching --- src/host.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/src/host.c b/src/host.c index 2ddae328..d337cc7c 100644 --- a/src/host.c +++ b/src/host.c @@ -1017,18 +1017,25 @@ sufmatch (const char **list, const char *what) int i, j, k, lw; lw = strlen (what); + for (i = 0; list[i]; i++) { - if (list[i][0] == '\0') - continue; + j = strlen (list[i]); + if (lw < j) + continue; /* what is no (sub)domain of list[i] */ - for (j = strlen (list[i]), k = lw; j >= 0 && k >= 0; j--, k--) + for (k = lw; j >= 0 && k >= 0; j--, k--) if (c_tolower (list[i][j]) != c_tolower (what[k])) break; - /* The domain must be first to reach to beginning. */ - if (j == -1) + + /* Domain or subdomain match + * k == -1: exact match + * k >= 0 && what[k] == '.': subdomain match + */ + if (j == -1 && (k == -1 || what[k] == '.')) return true; } + return false; } -- 2.15.1
signature.asc
Description: OpenPGP digital signature
