Hi, I responded to your original question on Stack Overflow. However for completeness and to document facts, I'll add a response here too.
The answer to your question is: No. Sadly enough, Wget does NOT check for the user agent string it is using when parsing the robots file. It simply reads rules for `User-Agent: *` and `User-Agent: wget` giving preference to the rules specified for Wget alone. This also has another major implication. Wget seems to be reading and adhering to robots rules ONLY for * and wget. Which means that not only does Wget ignore the correct robots exclusion rules, it even follows the wrong set of rules if Wget is using a different User-Agent and the website provides a set of rules for Wget. This bug can be seen in action by the test case I created. Apply the attached patch and run the Test--UA.py test. The patch is made against the new python based test suite which exists in the parallel-wget branch. On Fri, Jun 20, 2014 at 2:47 AM, György Chityil <[email protected]> wrote: > If I specify a custom user agent for wget, eg "MyBot 1.0 (info@mybot...)" > Will wget check this in robots.txt as well, if the bot was banned, or only > the general robot exclusions? Does wget check if "MyBot" is allowed to > crawl? > If not, this would be a nice feature. If yes, it would be great to include > this info in the robots overview here https://www.gnu.org/software/wget > > I originally posted this question here , but then I found this list > http://stackoverflow.com/questions/24316018/does-wget-check-if-specified-user-agent-is-allowed-in-robots-txt > > -- > Gyuri > 274 44 98 > 06 30 5888 744 -- Thanking You, Darshit Shah
From be0a0ba616eb7a413c92e25c8d7e86d8633972bc Mon Sep 17 00:00:00 2001 From: Darshit Shah <[email protected]> Date: Sun, 22 Jun 2014 00:58:53 +0530 Subject: [PATCH] Test case showing User agent bug --- testenv/Test--UA.py | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100755 testenv/Test--UA.py diff --git a/testenv/Test--UA.py b/testenv/Test--UA.py new file mode 100755 index 0000000..5f87f7e --- /dev/null +++ b/testenv/Test--UA.py @@ -0,0 +1,112 @@ +#!/usr/bin/env python3 +from sys import exit +from test.http_test import HTTPTest +from misc.wget_file import WgetFile + +""" + This test executed Wget in Spider mode with recursive retrieval. +""" +TEST_NAME = "Recursive Spider" +############# File Definitions ############################################### +mainpage = """ +<html> +<head> + <title>Main Page</title> +</head> +<body> + <p> + Some text and a link to a <a href="http://127.0.0.1:{{port}}/secondpage.html">second page</a>. + Also, a <a href="http://127.0.0.1:{{port}}/nonexistent">broken link</a>. + </p> +</body> +</html> +""" + +robots = """ +# robots.txt generated at http://www.mcanerin.com +User-agent: wget +Disallow: secondpage.html +""" + +secondpage = """ +<html> +<head> + <title>Second Page</title> +</head> +<body> + <p> + Some text and a link to a <a href="http://127.0.0.1:{{port}}/thirdpage.html">third page</a>. + Also, a <a href="http://127.0.0.1:{{port}}/nonexistent">broken link</a>. + </p> +</body> +</html> +""" + +thirdpage = """ +<html> +<head> + <title>Third Page</title> +</head> +<body> + <p> + Some text and a link to a <a href="http://127.0.0.1:{{port}}/dummy.txt">text file</a>. + Also, another <a href="http://127.0.0.1:{{port}}/againnonexistent">broken link</a>. + </p> +</body> +</html> +""" + +dummyfile = "Don't care." + + +index_html = WgetFile ("index.html", mainpage) +secondpage_html = WgetFile ("secondpage.html", secondpage) +thirdpage_html = WgetFile ("thirdpage.html", thirdpage) +dummy_txt = WgetFile ("dummy.txt", dummyfile) +robots_txt = WgetFile ("robots.txt", robots) + +Request_List = [ + [ + "HEAD /", + "GET /", + "GET /robots.txt", + "HEAD /secondpage.html", + "GET /secondpage.html", + "HEAD /nonexistent", + "HEAD /thirdpage.html", + "GET /thirdpage.html", + "HEAD /dummy.txt", + "HEAD /againnonexistent" + ] +] + +WGET_OPTIONS = "-d --spider --user-agent='Test bot' -r" +WGET_URLS = [[""]] + +Files = [[index_html, secondpage_html, thirdpage_html, dummy_txt, robots_txt]] + +ExpectedReturnCode = 8 +ExpectedDownloadedFiles = [] + +################ Pre and Post Test Hooks ##################################### +pre_test = { + "ServerFiles" : Files +} +test_options = { + "WgetCommands" : WGET_OPTIONS, + "Urls" : WGET_URLS +} +post_test = { + "ExpectedFiles" : ExpectedDownloadedFiles, + "ExpectedRetcode" : ExpectedReturnCode, + "FilesCrawled" : Request_List +} + +err = HTTPTest ( + name=TEST_NAME, + pre_hook=pre_test, + test_params=test_options, + post_hook=post_test +).begin () + +exit (err) -- 2.0.0
