Re: [Bug-wget] Does wget check if specified user agent is allowed in robots.txt?

Darshit Shah Sat, 21 Jun 2014 12:32:02 -0700

Hi,

I responded to your original question on Stack Overflow. However for
completeness and to document facts, I'll add a response here too.

The answer to your question is: No. Sadly enough, Wget does NOT check
for the user agent string it is using when parsing the robots file. It
simply reads rules for `User-Agent: *` and `User-Agent: wget` giving
preference to the rules specified for Wget alone.

This also has another major implication. Wget seems to be reading and
adhering to robots rules ONLY for * and wget. Which means that not
only does Wget ignore the correct robots exclusion rules, it even
follows the wrong set of rules if Wget is using a different User-Agent
and the website provides a set of rules for Wget.

This bug can be seen in action by the test case I created. Apply the
attached patch and run the Test--UA.py test. The patch is made against
the new python based test suite which exists in the parallel-wget
branch.

On Fri, Jun 20, 2014 at 2:47 AM, György Chityil
<[email protected]> wrote:
> If I specify a custom user agent for wget, eg "MyBot 1.0 (info@mybot...)"
> Will wget check this in robots.txt as well, if the bot was banned, or only
> the general robot exclusions? Does wget check if "MyBot" is allowed to
> crawl?
> If not, this would be a nice feature.  If yes, it would be great to include
> this info in the robots overview here https://www.gnu.org/software/wget
>
> I originally posted this question here , but then I found this list
> http://stackoverflow.com/questions/24316018/does-wget-check-if-specified-user-agent-is-allowed-in-robots-txt
>
> --
> Gyuri
> 274 44 98
> 06 30 5888 744

-- 
Thanking You,
Darshit Shah

From be0a0ba616eb7a413c92e25c8d7e86d8633972bc Mon Sep 17 00:00:00 2001
From: Darshit Shah <[email protected]>
Date: Sun, 22 Jun 2014 00:58:53 +0530
Subject: [PATCH] Test case showing User agent bug

---
 testenv/Test--UA.py | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 112 insertions(+)
 create mode 100755 testenv/Test--UA.py

diff --git a/testenv/Test--UA.py b/testenv/Test--UA.py
new file mode 100755
index 0000000..5f87f7e
--- /dev/null
+++ b/testenv/Test--UA.py
@@ -0,0 +1,112 @@
+#!/usr/bin/env python3
+from sys import exit
+from test.http_test import HTTPTest
+from misc.wget_file import WgetFile
+
+"""
+    This test executed Wget in Spider mode with recursive retrieval.
+"""
+TEST_NAME = "Recursive Spider"
+############# File Definitions ###############################################
+mainpage = """
+<html>
+<head>
+  <title>Main Page</title>
+</head>
+<body>
+  <p>
+    Some text and a link to a <a href="http://127.0.0.1:{{port}}/secondpage.html";>second page</a>.
+    Also, a <a href="http://127.0.0.1:{{port}}/nonexistent";>broken link</a>.
+  </p>
+</body>
+</html>
+"""
+
+robots = """
+# robots.txt generated at http://www.mcanerin.com
+User-agent: wget
+Disallow: secondpage.html
+"""
+
+secondpage = """
+<html>
+<head>
+  <title>Second Page</title>
+</head>
+<body>
+  <p>
+    Some text and a link to a <a href="http://127.0.0.1:{{port}}/thirdpage.html";>third page</a>.
+    Also, a <a href="http://127.0.0.1:{{port}}/nonexistent";>broken link</a>.
+  </p>
+</body>
+</html>
+"""
+
+thirdpage = """
+<html>
+<head>
+  <title>Third Page</title>
+</head>
+<body>
+  <p>
+    Some text and a link to a <a href="http://127.0.0.1:{{port}}/dummy.txt";>text file</a>.
+    Also, another <a href="http://127.0.0.1:{{port}}/againnonexistent";>broken link</a>.
+  </p>
+</body>
+</html>
+"""
+
+dummyfile = "Don't care."
+
+
+index_html = WgetFile ("index.html", mainpage)
+secondpage_html = WgetFile ("secondpage.html", secondpage)
+thirdpage_html = WgetFile ("thirdpage.html", thirdpage)
+dummy_txt = WgetFile ("dummy.txt", dummyfile)
+robots_txt = WgetFile ("robots.txt", robots)
+
+Request_List = [
+    [
+        "HEAD /",
+        "GET /",
+        "GET /robots.txt",
+        "HEAD /secondpage.html",
+        "GET /secondpage.html",
+        "HEAD /nonexistent",
+        "HEAD /thirdpage.html",
+        "GET /thirdpage.html",
+        "HEAD /dummy.txt",
+        "HEAD /againnonexistent"
+    ]
+]
+
+WGET_OPTIONS = "-d --spider --user-agent='Test bot' -r"
+WGET_URLS = [[""]]
+
+Files = [[index_html, secondpage_html, thirdpage_html, dummy_txt, robots_txt]]
+
+ExpectedReturnCode = 8
+ExpectedDownloadedFiles = []
+
+################ Pre and Post Test Hooks #####################################
+pre_test = {
+    "ServerFiles"       : Files
+}
+test_options = {
+    "WgetCommands"      : WGET_OPTIONS,
+    "Urls"              : WGET_URLS
+}
+post_test = {
+    "ExpectedFiles"     : ExpectedDownloadedFiles,
+    "ExpectedRetcode"   : ExpectedReturnCode,
+    "FilesCrawled"      : Request_List
+}
+
+err = HTTPTest (
+                name=TEST_NAME,
+                pre_hook=pre_test,
+                test_params=test_options,
+                post_hook=post_test
+).begin ()
+
+exit (err)
-- 
2.0.0

Re: [Bug-wget] Does wget check if specified user agent is allowed in robots.txt?

Reply via email to