Re: [nexa] R: R: ‘Biggest act of copyright theft in history’: thousands of Australian books allegedly used to train AI model | Australia news | The Guardian

Stefano Quintarelli Fri, 29 Sep 2023 08:08:09 -0700

grazie

ma il punto focale del mio quesito non e' il training ma, prima del training, la genesidei testi usati per il training


ciao, s.

On 29/09/23 16:36, Lorenzo Albertini wrote:

§§ 54-64della citazione in giudizio (facilmente reperibile ,_ad es.qui_<https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwiDh_vygtCBAxX4SPEDHZlhAxMQFnoECBYQAQ&url=https%3A%2F%2Fwww.classaction.org%2Fmedia%2Fauthors-guild-et-al-v-openai-inc-et-al.pdf&usg=AOvVaw1tUMb6Gk10kZCsvoAo0PH6&opi=89978449>):


<<54. Recent generative AI systems designed to recognize input text and generate

output text are built on “large language models” or “LLMs.”

55. LLMs use predictive algorithms that are designed to detect statistical 
patterns in

the text datasets on which they are “trained” and, on the basis of these 
patterns, generate

responses to user prompts. “Training” an LLM refers to the process by which the parametersthat


define an LLM’s behavior are adjusted through the LLM’s ingestion and analysis 
of large

“training” datasets.

56. Once “trained,” the LLM analyzes the relationships among words in an input

prompt and generates a response that is an approximation of similar 
relationships among words

in the LLM’s “training” data. In this way, LLMs can be capable of generating 
sentences,

paragraphs, and even complete texts, from cover letters to novels.

57. “Training” an LLM requires supplying the LLM with large amounts of text for

the LLM to ingest—the more text, the better. That is, in part, the large in large languagemodel.


58. As the U.S. Patent and Trademark Office has observed, LLM “training” “almost

by definition involve[s] the reproduction of entire works or substantial 
portions thereof.”4

59. “Training” in this context is therefore a technical-sounding euphemism for

“copying and ingesting.”

60. The quality of the LLM (that is, its capacity to generate human-seeming 
responses

to prompts) is dependent on the quality of the datasets used to “train” the LLM.

61. Professionally authored, edited, and published books—such as those authored 
by

Plaintiffs here—are an especially important source of LLM “training” data.

62. As one group of AI researchers (not affiliated with Defendants) has 
observed,

“[b]ooks are a rich source of both fine-grained information, how a character, an object ora scene


looks like, as well as high-level semantics, what someone is thinking, feeling 
and how these

states evolve through a story.”5

63. In other words, books are the high-quality materials Defendants want, need, 
and

have therefore outright pilfered to develop generative AI products that produce 
high-quality

results: text that appears to have been written by a human writer.

64. This use is highly commercial.>>.


_______________

Le informazioni contenute nella presente comunicazione e nei documenti ad essa allegatipotrebbero essere tutelate dal segreto professionale e sono comunque confidenziali e aduso esclusivo del destinatario sopra indicato. Qualora la presente comunicazione non fossedestinata a Voi, Vi preghiamo di tener presente che la divulgazione, distribuzione oriproduzione di qualunque informazione contenuta nella presente comunicazione o neidocumenti ad essa allegati sono vietate. Se avete ricevuto la presente comunicazione pererrore, Vi preghiamo di volerci avvertire immediatamente e di distruggere quanto ricevutosenza leggerlo. Grazie per la collaborazione.

The information contained in this email and any documents attached to it may be legallyprivileged and confidential. The information is intended only for the use of theindividual or entity named above. If you are not the intended recipient, you are herebynotified that any use, dissemination, distribution or reproduction of any informationcontained in or attached to this email is prohibited. If you have received this email inerror, please immediately notify us by reply email or by telephone, and destroy theoriginal transmission and its attachments without reading them. Thank you.

-----Messaggio originale-----

Da: nexa <[email protected]> Per conto di Rossana

Morriello

Inviato: venerdì 29 settembre 2023 16:08

A: Nexa <[email protected]>

Oggetto: [nexa] R: ‘Biggest act of copyright theft in history’: thousands of

Australian books allegedly used to train AI model | Australia news | The

Guardian

Non sono una giurista ma credo che questa rassegna possa essere utile alla

discussione

https://www.thefashionlaw.com/from-chatgpt-to-deepfake-creating-apps-a-<https://www.thefashionlaw.com/from-chatgpt-to-deepfake-creating-apps-a-running-list-of-key-ai-lawsuits/>

running-list-of-key-ai-lawsuits/

Saluti

Rossana Morriello

-----Messaggio originale-----

Da: nexa <[email protected]<mailto:[email protected]>>

Per conto di Stefano

Quintarelli

Inviato: venerdì 29 settembre 2023 15:21

Cc: Nexa <[email protected]<mailto:[email protected]>>

Oggetto: Re: [nexa] ‘Biggest act of copyright theft in history’: thousands of

Australian books allegedly used to train AI model | Australia news | The

Guardian

Ho una domanda per i giuristi (anzi, piu' di una)

per allenare un modello, ho bisogno di un file con la versione digitale di un

testo.

(cosnsidero ovviamente testi non PD, CC0, ecc.)

la versione digitale di un testo la posso ottenere da un ebook (gia' digitale),

togliendo il probabile DRM.

ma un ebook non e' unbene ma e' un servizio soggetto a licenza d'uso, quindi

se non e'

prevista nella licenza d'uso la facolta' di estrarre il testo digitale per 
allenarci un

modello, mi sembra che ci sia gia' una violazione della licenza, per cui, credo,

non possa essere usato come base di un allenamento, tanto piu' se il fine di

tale allenamento e'

commerciale (se vendo un servizio basato su quel modello).

se e' cosi', per allenare il mio modello  devo allora prednere il testo digitale

facendo scan/ocr di un testo cartaceo.

ma cio' e' possibile, se non erro, solo per uso personale e non commerciale.

se questo e' corretto, non mi pare ci sia un modo per prendere un testo digitale

senza infrangere una licenza d'uso/copyright

dove e' la fallacia del ragionamento ?

grazie, s.

On 29/09/23 15:00, Stefano Borroni Barale wrote:

> Buongiorno lista,

>> L'idea che istruire un modello su dei testi coperti da copyright sia

>> una violazione del suddetto copyright è altamente opinabile

> Fin qui, ho l'impressione che tutti i legali in lista concorderanno.

>> ragionamento è in realtà abbastanza semplice: se istruirsi su un

>> testo ne violasse il copyright, saremmo tutti dei criminali.

> Ma siccome noi siamo umani e quello che produciamo non è - salvo i discorsi

dei politici(*) - ontologicamente identico alla produzione di esseri tecnici non

viventi, logica vuole che quanto si applica a noi non possa applicarsi a un LLM,

tanto quanto la legge sul copyright non si applica pedissequamente all'utilizzo

di testi umani per creare modelli linguistici.

> Questo è il motivo per il quale tutti i tentativi di "proteggere via 
copyright" il

prodotto di software generativi sono falliti miseramente, e con motivazioni

scritte in sentenze; che per il diritto credo abbiano un peso assai maggiore del

sito di CC.

> La mia impressione è che la questione terrà impegnati legali, informatici,

filosofi e società ancora moooooolto a lungo.

> SBB

> (*) Come sanno bene i bambini degli anni '80 che hanno giocato con

> questo spassoso giocattolo:

>https://www.enricodalbosco.it/giochi/tubolario/<https://www.enricodalbosco.it/giochi/tubolario/>

> Di quei testi

>> non c'è fisicamente traccia all'interno dei modelli, non viene

>> copiato niente. I modelli sono un'opera trasformativa di quei testi,

>> non derivativa.

>>

>> Lo argomenta molto bene Creative Commons:

>>https://creativecommons.org/2023/02/17/fair-use-training-generative-a<https://creativecommons.org/2023/02/17/fair-use-training-generative-a>

>> i/

>>

>> Detto questo, cito le parole di un altro autore, Jeff Jarvis:

>>

https://www.facebook.com/jeff.jarvis/posts/pfbid0LMFeqdTYoxnGHQAZwp5<https://www.facebook.com/jeff.jarvis/posts/pfbid0LMFeqdTYoxnGHQAZwp5H>

>> MmeeVqgMSjL2dkcwMcBojkb2cinBpgYTHyc7Fhq1B9NPl

>>

>> «I, for one, am not complaining about my books being in in large

>> language model training sets. I write to enter ideas into public

>> discourse. I prefer informed over ignorant AI. I believe it is fair

>> use for anyone to read & use books for transformative work. In fact,

>> I'd probably feel snubbed if my books were not there. I'm happy when

>> they are in libraries. I'm fine that they're here.»

>>

>> Fabio

>>

>> Il giorno ven 29 set 2023 alle ore 07:52 Alberto Cammozzo via nexa

>>[email protected]<mailto:[email protected]>ha scritto:

>>

>>>https://www.theguardian.com/australia-news/2023/sep/28/australian-<https://www.theguardian.com/australia-news/2023/sep/28/australian-bo>

bo

>>> oks-training-ai-books3-stolen-pirated

>>>

>>> Thousands of books from some of Australia’s most celebrated authors

have potentially been caught up in what Booker prize-winning novelist Richard

Flanagan has called “the biggest act of copyright theft in history”.

>>>

>>> The works have allegedly been pirated by the US-based Books3 dataset

and used to train generative AI for corporations such as Meta and Bloomberg.

>>>

>>> Flanagan, who found 10 of his works, including the multi-international

award-winning 2013 novel The Narrow Road to the Deep North, on the

Books3 dataset, told Guardian Australia he was deeply shocked by the

discovery made several days ago.

>>>

>>> “I felt as if my soul had been strip mined and I was powerless to stop it,”

he said in a statement.

>>>

>>> “This is the biggest act of copyright theft in history.”

>>>

>>> AI could ‘turbo-charge fraud’ and be monopolised by tech companies,

>>> Andrew Leigh warns

>>>

>>> The Australian Publishers Association confirmed to Guardian Australia on

Wednesday that as many as 18,000 fiction and nonfiction titles with

Australian ISBNs (unique international standard book numbers) appeared to

be affected by the copyright infringement, although it is not yet clear what

proportion of these are Australian editions of internationally authored books.

>>>

>>> “We’re still working through [the data] to work out the impact in terms of

Australian authors,” APA spokesperson Stuart Glover said.

>>>

>>> “This is a massive legal and ethical challenge for the publishing industry

and for authors globally.”

>>>

>>> A search tool published on Monday by US media platform The Atlantic and

uploaded by the US Authors Guild on Wednesday revealed the works of Peter

Carey, Helen Garner, Kate Grenville, Anna Funder, Christos Tsiolkas and

Thomas Keneally, as well as Flanagan and dozens of other high-profile

Australian authors, were included in the pirated dataset containing more than

180,000 titles.

>>>

>>> On Thursday, the Australian Society of Authors issued a statement saying

it was “horrified” to learn that the works of Australian writers were being used

to train artificial intelligence without permission from the authors.

>>>

>>> ASA chief executive, Olivia Lanchester, described the Books3 dataset as

piracy on an industrial scale.

>>>

>>> “Authors appropriately feel outraged,” Lanchester said. “The fact is this

technology relies upon books, journals, essays written by authors, yet

permission was not sought nor compensation granted.”

>>>

>>> Lanchester said the Australian literary industry, while not objecting per se

to emerging technologies such as AI, was deeply concerned about the lack of

transparency evident in the development and monetisation of AI by global

tech companies.

>>>

>>> “Turning a blind eye to the legitimate rights of copyright owners threatens

to diminish already precarious creative careers,” she said.

>>>

>>> “The enrichment of a few powerful companies is at the cost of thousands

of individual creators. This is not how a fair market functions.”

>>>

>>> Josephine Johnston, chief executive of Australia’s Copyright Agency,

described the Books3 development as “a free kick to big tech” at the expense

of Australia’s creative and cultural life.

>>>

>>> “We’re going to need greater transparency – how these tools have been

developed, trained, how they operate – before people can truly understand

what their legal rights might be,” she said.

>>>

>>> “We seem to be in this terrible position now where content owners –

remembering that the vast majority of them will be individual authors – may

actually have to take out court cases to enforce their rights.”

>>>

>>> Australian copyright law protects creators of original content from data

scraping.

>>>

>>> Litigation in the US against ChatGPT creator OpenAI over use of allegedly

pirated book datasets, Books1 and Books2 (which do not appear to be

affiliated with Books3) has already commenced.

>>>

>>> In July, North American horror/fantasy writers Mona Awad (author of

Bunny) and Paul Tremblay (author of The Cabin at the End of the World) filed a

lawsuit in a San Francisco federal court, alleging ChatGPT unlawfully digested

their books as part of its AI training data.

>>>

>>> On 28 August, OpenAI filed a motion to dismiss the lawsuit, arguing that

the authors “misconceive the scope of copyright, failing to take into account

the limitations and exceptions (including fair use) that properly leave room for

innovations like the large language models now at the forefront of artificial

intelligence”.

>>>

>>> On 19 September the Writers Guild and 17 of its members, including

bestselling novelists John Grisham, George RR Martin and Jodi Picoult, filed a

complaint in a New York district court against OpenAI, seeking redress for

“flagrant and harmful infringements” of guild members’ registered copyrights.

>>>

>>> In a statement on its website, the guild says while it is aware that

companies such as Meta and Bloomberg have used the Books3 dataset to

train their LLMs, it is not yet clear whether OpenAI is using Books3 to train 
its

ChatGPT models GPT 3.5 or GPT 4.

>>>

>>> Democracies face ‘truth decay’ as AI blurs fact and fiction, warns

>>> head of Australia’s military

>>>

>>> Guardian Australia has sought comment from OpenAI, which has yet to

officially respond to the guild’s complaint, and Meta.

>>>

>>> On 4 September, US technology magazine Wired reported that a Danish

anti-piracy group called Rights Alliance had been told by Bloomberg that the

company did not plan to train future versions of its BloombergGPT using

Books3.

>>>

>>> Bloomberg declined to respond to the Guardian’s queries.

>>>

>>> The APA said the global nature of the issue would present significant

challenges in enforcement and prosecution, and has joined the authors’

society in calling for AI technologies to be regulated.

>>>

>>> Consultation closed last month for a Department of Industry, Science and

Resources discussion paper on supporting responsible AI.

>>>

>>> A parliamentary inquiry is under way examining the use of generative

artificial intelligence in the Australian education system.

>>>

>>> Flanagan said it was up to the Australian government to act to protect

Australia’s writers.

>>>

>>> “It has power and we do not,” he said.

>>>

>>> “If it cares for our culture it must now stand up and fight for it.”

>>>

>>> _______________________________________________

>>> nexa mailing list

>>>[email protected]<mailto:[email protected]>

>>>https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa<https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa>

>>

>> _______________________________________________

>> nexa mailing list

>>[email protected]<mailto:[email protected]>

>>https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa<https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa>

> _______________________________________________

> nexa mailing list

>[email protected]<mailto:[email protected]>

>https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa<https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa>

_______________________________________________

nexa mailing list

[email protected]<mailto:[email protected]>

https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa<https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa>

_______________________________________________

nexa mailing list

[email protected]<mailto:[email protected]>

https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa<https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa>



_______________________________________________
nexa mailing list
[email protected]
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa

_______________________________________________
nexa mailing list
[email protected]
https://server-nexa.polito.it/cgi-bin/mailman/listinfo/nexa

Re: [nexa] R: R: ‘Biggest act of copyright theft in history’: thousands of Australian books allegedly used to train AI model | Australia news | The Guardian

Reply via email to