[cctalk] Re: Large language model (LLM) Web Scrapers

Ethan O'Toole via cctalk Wed, 17 Sep 2025 09:12:10 -0700

A web crawler that does not obey robots.txt is not a law abiding outfit.Best would be to block it entirely. If they are that dismissive ofhonesty, they are also unlikely to pay attention to such matters ascopyright and intellectual property ownership.
        paul

A forum related to laser show syatems I use from time to time was gettinghit by scrapers coming from 300,000 unique IP addresses.


I hit a different issue, they were completing captcha and spamming wiki.

     - Ethan

On Sep 16, 2025, at 8:55 PM, Wayne S via cctalk <[email protected]> wrote:

They do not observe robots .txt
Sent from my iPhone

On Sep 16, 2025, at 17:53, Wayne S <[email protected]> wrote:

I did notice the scraping.
I toyed with the idea of putting ludicrous text files up that a normal user 
would not see and see which bot got them.

Sent from my iPhone

On Sep 16, 2025, at 17:02, Bill Degnan via cctalk <[email protected]> wrote:

For those of you who run vintage computing-related info sites, have you
noticed all of the LLM scraper activity?    AI services are using the LLM
scrapers to populate their knowledge bases.

At any given moment 5-10 of them are active on vintagecomputer.net.  It’s
funny, when I ask an AI about something vintage computing-related,
something obscure, I can trick into giving me an answer from my own site.

I have actually had to modify the site code to manage the traffic, to
improve efficiency.

But they’re not going after just my site, these scrapers are absorbing
copies of the entire WWW.

I wonder how long the WWW will remain open, it would be a bummer if I found
copies of my site elsewhere.

Bill


--
: Ethan O'Toole

[cctalk] Re: Large language model (LLM) Web Scrapers

Reply via email to