Re: Stopping webcrawlers.

john doe Sun, 03 Nov 2019 09:42:51 -0800

On 11/3/2019 6:26 PM, Gene Heskett wrote:
> On Sunday 03 November 2019 11:56:52 Reco wrote:
>
>> On Sun, Nov 03, 2019 at 10:48:58AM -0500, Gene Heskett wrote:
>>> On Sunday 03 November 2019 10:23:50 Reco wrote:
>>>> On Sun, Nov 03, 2019 at 10:04:46AM -0500, Gene Heskett wrote:
>>>>> Greetings all
>>>>>
>>>>> I am developing a list of broken webcrawlers who are repeatedly
>>>>> downloading my entire web site including the hidden stuff.
>>>>>
>>>>> These crawlers/bots are ignoring my robots.txt
>>>>
>>>> $ wget -O - https://www.shentel.com/robots.txt
>>>> --2019-11-03 15:22:35--  https://www.shentel.com/robots.txt
>>>> Resolving www.shentel.com (www.shentel.com)... 45.60.160.21
>>>> Connecting to www.shentel.com
>>>> (www.shentel.com)|45.60.160.21|:443... connected. HTTP request
>>>> sent, awaiting response... 403 Forbidden 2019-11-03 15:22:36 ERROR
>>>> 403: Forbidden.
>>>>
>>>> Allowing said bots to *see* your robots.txt would be a step into
>>>> the right direction.
>>>
>>> But you are asking for shentel.com/robots.txt which is my isp.
>>> You should be asking for
>>>
>>> http://geneslinuxbox.net:6309/gene/robots.txt
>>
>> Wow. You sir owe me a new set of eyes.
>
> Chuckle :) That was the default I'd pickup up from someplace years ago.
>
>> I advise you to compare your monstrosity to this (a hint - it does
>> work) - [1].
>>
>> Reco
>>
>> [1] https://enotuniq.net/robots.txt
>
> I'll trim mine forthwith to the last entry.  I've wondered if that was
> too long a list. And restart apache2 of course. But now I see the next
> access is not a 200, but a 404, that not intended. From the access log:
>
> coyote.coyote.den:80 209.197.24.34 - -
> [03/Nov/2019:12:19:55 -0500] "GET /gene/lathe-stf/linuxcnc4rpi4
> HTTP/1.1" 404 498 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0;
> rv:11.0) like Gecko"
>
> that directory exists, shouldn't that have been a 200?
>


The directory might exist but it is not accessible.

--
John Doe

Re: Stopping webcrawlers.

Reply via email to