On 11/3/2019 6:26 PM, Gene Heskett wrote: > On Sunday 03 November 2019 11:56:52 Reco wrote: > >> On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote: >>> On Sunday 03 November 2019 10:23:50 Reco wrote: >>>> On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote: >>>>> Greetings all >>>>> >>>>> I am developing a list of broken webcrawlers who are repeatedly >>>>> downloading my entire web site including the hidden stuff. >>>>> >>>>> These crawlers/bots are ignoring my robots.txt >>>> >>>> $ wget -O - https://www.shentel.com/robots.txt >>>> --2019-11-03 15:22:35-- https://www.shentel.com/robots.txt >>>> Resolving www.shentel.com (www.shentel.com)... 45.60.160.21 >>>> Connecting to www.shentel.com >>>> (www.shentel.com)|45.60.160.21|:443... connected. HTTP request >>>> sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36 ERROR >>>> 403: Forbidden. >>>> >>>> Allowing said bots to *see* your robots.txt would be a step into >>>> the right direction. >>> >>> But you are asking for shentel.com/robots.txt which is my isp. >>> You should be asking for >>> >>> http://geneslinuxbox.net:6309/gene/robots.txt >> >> Wow. You sir owe me a new set of eyes. > > Chuckle :) That was the default I'd pickup up from someplace years ago. > >> I advise you to compare your monstrosity to this (a hint - it does >> work) - [1]. >> >> Reco >> >> [1] https://enotuniq.net/robots.txt > > I'll trim mine forthwith to the last entry. I've wondered if that was > too long a list. And restart apache2 of course. But now I see the next > access is not a 200, but a 404, that not intended. From the access log: > > coyote.coyote.den:80 209.197.24.34 - - > [03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4 > HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; > rv:11.0) like Gecko" > > that directory exists, shouldn't that have been a 200? >
The directory might exist but it is not accessible. -- John Doe