The wordfreq data is a snapshot of language that could be found in
various online sources up through 2021. There are several reasons why
it will not be updated anymore. 

# Generative AI has polluted the data

I don't think anyone has reliable information about post-2021 language
usage by humans.

The open Web (via OSCAR) was one of wordfreq's data sources. Now the
Web at large is full of slop generated by large language models,
written by no one to communicate nothing. Including this slop in the
data skews the word frequencies. [...]

As one example, [Philip Shapira reports]() that ChatGPT (OpenAI's
popular brand of generative language model circa 2024) is obsessed with
the word "delve" in a way that people never have been, and caused its
overall frequency to increase by an order of magnitude.

## Information that used to be free became expensive

[...] Even if X made its raw data feed available (which it doesn't),
there would be no valuable information to be found there.

Reddit also stopped providing public data archives, and now they sell
their archives at a price that only OpenAI will pay.

## I don't want to be part of this scene anymore

wordfreq used to be at the intersection of my interests. I was doing
corpus linguistics in a way that could also benefit natural language
processing tools.

The field I know as "natural language processing" is hard to find these
days. It's all being devoured by generative AI. Other techniques still
exist but generative AI sucks up all the air in the room and gets all
the money. It's rare to see NLP research that doesn't have a dependency
on closed data controlled by OpenAI and Google, two companies that I
already despise.

wordfreq was built by collecting a whole lot of text in a lot of
languages. That used to be a pretty reasonable thing to do, and not the
kind of thing someone would be likely to object to. Now, the
text-slurping tools are mostly used for training generative AI, and
people are quite rightly on the defensive. If someone is collecting all
the text from your books, articles, Web site, or public posts, it's
very likely because they are creating a plagiarism machine that will
claim your words as its own.

So I don't want to work on anything that could be confused with
generative AI, or that could benefit generative AI.


https://github.com/rspeer/wordfreq/blob/master/SUNSET.md



[1] https://pshapira.net/2024/03/31/delving-into-delve/

Reply via email to