<https://www.washingtonpost.com/technology/2024/01/04/nyt-ai-copyright-lawsuit-fair-use/>


If a media outlet copied a bunch of New York Times stories and posted them on 
its site, that would probably be seen as a blatant violation of the Times’s 
copyright.

But what about when a tech company copies those same articles, combines them 
with countless other copied works, and uses them to train an AI chatbot capable 
of conversing on almost any topic — including the ones it learned about from 
the Times?



That’s the legal question at the heart of a lawsuit the Times filed against 
OpenAI and Microsoft in federal court last week, alleging that the tech firms 
illegally used “millions” of copyrighted Times articles to help develop the AI 
models behind tools such as ChatGPT and Bing. It’s the latest, and some believe 
the strongest, in a bevy of active lawsuits alleging that various tech and 
artificial intelligence companies have violated the intellectual property of 
media companies, photography sites, book authors and artists.

Together, the cases have the potential to rattle the foundations of the booming 
generative AI industry, some legal experts say — but they could also fall flat. 
That’s because the tech firms are likely to lean heavily on a legal concept 
that has served them well in the past: the doctrine known as “fair use.”

Broadly speaking, copyright law distinguishes between ripping off someone 
else’s work verbatim — which is generally illegal — and “remixing” or putting 
it to a new, creative use. What is confounding about AI systems, said James 
Grimmelmann, a professor of digital and information law at Cornell University, 
is that in this case they seem to be doing both.

Generative AI represents “this big technological transformation that can make a 
remixed version of anything,” Grimmelmann said. “The challenge is that these 
models can also blatantly memorize works they were trained on, and often 
produce near-exact copies,” which, he said, is “traditionally the heart of what 
copyright law prohibits.”

From the first VCRs, which could be used to record TV shows and movies, to 
Google Books, which digitized millions of books, U.S. companies have convinced 
courts that their technological tools amounted to fair use of copyrighted 
works. OpenAI and Microsoft are already mounting a similar defense.

“We believe that the training of AI models qualifies as a fair use, falling 
squarely in line with established precedents recognizing that the use of 
copyrighted materials by technology innovators in transformative ways is 
entirely consistent with copyright law,” OpenAI wrote in a filing to the U.S. 
Copyright Office in November.

AI systems are typically “trained” on gargantuan data sets that include vast 
amounts of published material, much of it copyrighted. Through this training, 
they come to recognize patterns in the arrangement of words and pixels, which 
they can then draw on to assemble plausible prose and images in response to 
just about any prompt.

Some AI enthusiasts view this process as a form of learning, not unlike an art 
student devouring books on Monet or a news junkie reading the Times 
cover-to-cover to develop their own expertise. But plaintiffs see a more 
quotidian process at work beneath these models’ hood: It’s a form of copying, 
and unauthorized copying at that.

“It’s not learning the facts like a brain would learn facts,” said Danielle 
Coffey, chief executive of the News/Media Alliance, a trade group that 
represents more than 2,000 media organizations, including the Times and The 
Washington Post. “It’s literally spitting the words back out at you.”

There are two main prongs to the New York Times’s case against OpenAI and 
Microsoft. First, like other recent AI copyright lawsuits, the Times argues 
that its rights were infringed when its articles were “scraped” — or digitally 
scanned and copied — for inclusion in the giant data sets that GPT-4 and other 
AI models were trained on. That’s sometimes called the “input” side.

Second, the Times’s lawsuit cites examples in which OpenAI’s GPT-4 language 
model — versions of which power both ChatGPT and Bing — appeared to cough up 
either detailed summaries of paywalled articles, like the company’s Wirecutter 
product reviews, or entire sections of specific Times articles. In other words, 
the Times alleges, the tools violated its copyright with their “output,” too.

Judges so far have been wary of the argument that training an AI model on 
copyrighted works — the “input” side — amounts to a violation in itself, said 
Jason Bloom, a partner at the law firm Haynes and Boone and the chairman of its 
intellectual property litigation group.

“Technically, doing that can be copyright infringement, but it’s more likely to 
be considered fair use, based on precedent, because you’re not publicly 
displaying the work when you’re just ingesting and training” with it, Bloom 
said. (Bloom is not involved in any of the active AI copyright suits.)

Fair use also can apply when the copying is done for a purpose different from 
simply reproducing the original work — such as to critique it or to use it for 
research or educational purposes, like a teacher photocopying a news article to 
hand out to a journalism class. That’s how Google defended Google Books, an 
ambitious project to scan and digitize millions of copyrighted books from 
public and academic libraries so that it could make their contents searchable 
online.

The project sparked a 2005 lawsuit by the Authors Guild, which called it a 
“brazen violation of copyright law.” But Google argued that because it 
displayed only “snippets” of the books in response to searches, it wasn’t 
undermining the market for books but providing a fundamentally different 
service. In 2015, a federal appellate court agreed with Google.

That precedent should work in favor of OpenAI, Microsoft and other tech firms, 
said Eric Goldman, a professor at Santa Clara University School of Law and 
co-director of its High Tech Law Institute.

“I’m going to take the position, based on precedent, that if the outputs aren’t 
infringing, then anything that took place before isn’t infringing as well,” 
Goldman said. “Show me that the output is infringing. If it’s not, then 
copyright case over.”

OpenAI and Microsoft are also the subject of other AI copyright lawsuits, as 
are rival AI firms including Meta, Stability AI and Midjourney, with some 
targeting text-based chatbots and others targeting image generators. So far, 
judges have dismissed parts of at least two cases in which the plaintiffs 
failed to demonstrate that the AI’s outputs were substantially similar to their 
copyrighted works.

In contrast, the Times’s suit provides numerous examples in which a version of 
GPT-4 reproduced large passages of text identical to that in Times articles in 
response to certain prompts.

That could go a long way with a jury, should the case get that far, said Blake 
Reid, associate professor at Colorado Law. But if courts find that only those 
specific outputs are infringing, and not the use of the copyrighted material 
for training, he added, that could prove much easier for the tech firms to fix.

OpenAI’s position is that the examples in the Times’s lawsuit are aberrations — 
a sort of bug in the system that caused it to cough up passages verbatim.



Tom Rubin, OpenAI’s chief of intellectual property and content, said the Times 
appears to have intentionally manipulated its prompts to the AI system to get 
it to reproduce its training data. He said via email that the examples in the 
lawsuit “are not reflective of intended use or normal user behavior and violate 
our terms of use.”

“Many of their examples are not replicable today,” Rubin added, “and we 
continually make our products more resilient to this type of misuse.”

The Times isn’t the only organization that has found AI systems producing 
outputs that resemble copyrighted works. A lawsuit filed by Getty Images 
against Stability AI notes examples of its Stable Diffusion image generator 
reproducing the Getty watermark. And a recent blog post by AI expert Gary 
Marcus shows examples in which Microsoft’s Image Creator appeared to generate 
pictures of famous characters from movies and TV shows.

Microsoft did not respond to a request for comment.

The Times did not specify the amount it is seeking, although the company 
estimates damages to be in the “billions.” It is also asking for a permanent 
ban on the unlicensed use of its work. More dramatically, it asks that any 
existing AI models trained on Times content be destroyed.

Because the AI cases represent new terrain in copyright law, it is not clear 
how judges and juries will ultimately rule, several legal experts agreed.

While the Google Books case might work in the tech firms’ favor, the fair-use 
picture was muddied by the Supreme Court’s recent decision in a case involving 
artist Andy Warhol’s use of a photograph of the rock star Prince, said Daniel 
Gervais, a professor at Vanderbilt Law and director of its intellectual 
property program. The court found that if the copying is done to compete with 
the original work, “that weighs against fair use” as a defense. So the Times’s 
case may hinge in part on its ability to show that products like ChatGPT and 
Bing compete with and harm its business.

“Anyone who’s predicting the outcome is taking a big risk here,” Gervais said. 
He said for business plaintiffs like the New York Times, one likely outcome 
might be a settlement that grants the tech firms a license to the content in 
exchange for payment. The Times spent months in talks with OpenAI and 
Microsoft, which holds a major stake in OpenAI, before the newspaper sued, the 
Times disclosed in its lawsuit.

Some media companies have already struck arrangements over the use of their 
content. Last month, OpenAI agreed to pay German media conglomerate Axel 
Springer, which publishes Business Insider and Politico, to show parts of 
articles in ChatGPT responses. The tech company has also struck a deal with the 
Associated Press for access to the news service’s archives.

A Times victory could have major consequences for the news industry, which has 
been in crisis since the internet began to supplant newspapers and magazines 
nearly 20 years ago. Since then, newspaper advertising revenue has been in 
steady decline, the number of working journalists has dropped dramatically and 
hundreds of communities across the country no longer have local newspapers.

But even as publishers seek payment for the use of their human-generated 
materials to train AI, some also are publishing works produced by AI — which 
has prompted both backlash and embarrassment when those machine-created 
articles are riddled with errors.

Cornell’s Grimmelmann said AI copyright cases might ultimately hinge on the 
stories each side tells about how to weigh the technology’s harms and benefits.

“Look at all the lawsuits, and they’re trying to tell stories about how these 
are just plagiarism machines ripping off artists,” he said. “Look at the [AI 
firms’ responses], and they’re trying to tell stories about all the really 
interesting things these AIs can do that are genuinely new and exciting.”

Reid of Colorado Law noted that tech giants may make less sympathetic 
defendants today for many judges and juries than they did a decade ago when the 
Google Books case was being decided.

“There’s a reason you’re hearing a lot about innovation and open-source and 
start-ups” from the tech industry, he said. “There’s a race to frame who’s the 
David and who’s the Goliath here.”

_______________________________________________
nexa mailing list
[email protected]
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa

Reply via email to