[cctalk] Re: Large language model (LLM) Web Scrapers

Scott Baker via cctalk Wed, 17 Sep 2025 13:36:55 -0700

The way I look at it, for my personal site and content,

1. I'm not going to win the arms race. It's not my area of expertise. I
can't put hours into devising anti-bot countermeasures.

2. If I do try to implement countermeasures to prevent bots, I will likely
also end up impacting/inconveniencing some legitimate users.

3. Ultimately my goal is to help people, so if my content ends up training
an AI model,. and that model ends up helping people, then I'm indirectly
meeting my goal.

4. Many of the AIs are now citing their sources, and that means I get some
level of attribution and recognition.

5. Some archivals, such as wayback machine, I find to be extremely useful
to vintage computer research. People die. Providers shut down. A lot of
knowledge has been lost. I'll be happy if my content eventually outlives me.

I wish there would be more focus on (4). Everyone deserves recognition of
their work and their content. I'd support legislation to require that
sources are cited/acknowledged when AI results are returned. I think
there's some risk of "content laundering", i.e. a bot is trained from your
content, someone publishes an AI-generated article, and the next bot is
trained from that AI-generated content, losing the original attribution.
Without discipline, it can turn into a bunch of slop that nobody knows
where it came from, or the accuracy of the information.

Scott

On Wed, Sep 17, 2025 at 11:31 AM Bill Degnan via cctalk <
[email protected]> wrote:

> On Wed, Sep 17, 2025 at 1:27 PM The Doctor via cctalk <
> [email protected]>
> wrote:
>
> > On Tuesday, September 16th, 2025 at 17:01, Bill Degnan via cctalk <
> > [email protected]> wrote:
> >
> > > I wonder how long the WWW will remain open, it would be a bummer if I
> > found
> > > copies of my site elsewhere.
> >
> > I've been thinking about this myself.  It does not please me.
> >
> > What web server do you use for your site?  I've got some pretty robust
> but
> > easy to admin
> > countermeasures set up on my own website that I'd be happy to share if
> > there is interest.
> >
> >
> >
> I run a web services company, vintagecomputer.net is internally-supported.
>  vintagecomputer.net has been dealing with some sort of scrapers for 20
> years.  The site is privately hosted and has web scraping control measures,
> built to detect a whole array of bot activity.  Rather than block, I
> believe it's better to detect and log, and then determine how best to
> manage new types of bot probing and scraping on an ongoing basis, it's a
> great way to learn white hat hacking.
>

[cctalk] Re: Large language model (LLM) Web Scrapers

Reply via email to