Hi,

this is not a PSL matching, so no libpsl is needed.

Just sufmatch() has to be fixed to do (sub)domain matching.

Attached is a fix.


With Best Regards, Tim



On 01/17/2018 03:01 PM, Darshit Shah wrote:
> Hi,
> 
> This is a bug in Wget, apparently a really old one! Seems like the bug has 
> been
> around since atleast 1997.
> 
> Looking at the source, the issue is that Wget does a very simple suffix
> matching on the actual domain and accepted domains list. This is obviously
> wrong as you have just found out.
> 
> I'm going to try and implement this correctly, but I'm currently a little 
> short
> on time, so if anyone else wants to pick it up, please feel free to. It's
> simple, use libpsl to get the proper domain name and match against that.
> 
> 
> Of course, this change will require libpsl to no longer be an optional
> dependency
> 
> * Friso van Vollenhoven <[email protected]> [180117 14:40]:
>> Hello all,
>>
>> I am trying to do a recursive download of a webpage and span multiple hosts
>> within the same domain, but not cross to other domains. The issue is that
>> the crawl does extend to other domains. My full command is this:
>>
>> wget \
>> --recursive \
>> --no-clobber \
>> --page-requisites \
>> --adjust-extension \
>> --span-hosts \
>> --domains=scapino.nl \
>> --no-parent \
>> --tries=2 \
>> --wait=1 \
>> --random-wait \
>> --waitretry=2 \
>> --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)
>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' \
>> https://www.scapino.nl/winkels/scapino-utrecht-510061
>>
>> From this combination of --span-hosts and --domains, I would expect to
>> download assets from cdn.scapino.nl and www.scapino.nl, but not other
>> domains. For some reason that I don't understand, wget also starts to do
>> what looks like a full crawl of the domain werkenbijscapino.nl, which is
>> referenced from the original page.
>>
>> Any thoughts or direction would be much appreciated.
>>
>> I am using wget 1.18 on Debian.
>>
>>
>> Best regards,
>> Friso
> 
From 1ad636baa63cfe029c84235986626c17e4ff33cb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Tim=20R=C3=BChsen?= <[email protected]>
Date: Wed, 17 Jan 2018 15:50:48 +0100
Subject: [PATCH] * src/host.c (sufmatch): Fix to domain matching

---
 src/host.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/src/host.c b/src/host.c
index 2ddae328..d337cc7c 100644
--- a/src/host.c
+++ b/src/host.c
@@ -1017,18 +1017,25 @@ sufmatch (const char **list, const char *what)
   int i, j, k, lw;
 
   lw = strlen (what);
+
   for (i = 0; list[i]; i++)
     {
-      if (list[i][0] == '\0')
-        continue;
+      j = strlen (list[i]);
+      if (lw < j)
+        continue; /* what is no (sub)domain of list[i] */
 
-      for (j = strlen (list[i]), k = lw; j >= 0 && k >= 0; j--, k--)
+      for (k = lw; j >= 0 && k >= 0; j--, k--)
         if (c_tolower (list[i][j]) != c_tolower (what[k]))
           break;
-      /* The domain must be first to reach to beginning.  */
-      if (j == -1)
+
+      /* Domain or subdomain match
+       * k == -1: exact match
+       * k >= 0 && what[k] == '.': subdomain match
+       */
+      if (j == -1 && (k == -1 || what[k] == '.'))
         return true;
     }
+
   return false;
 }
 
-- 
2.15.1

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to