On 5/6/26 03:44, Nguyễn Gia Phong via Development of GNU Guix and the
GNU System distribution. wrote:
Hi Greg,
On 2026-05-05 at 15:04-04:00, Greg Hogan wrote:
I would be interested to see an analysis of incidental memoization
rather than the claims of extractive memoization typically presented.
And taking into consideration the age of the models.
Same here.
The "open-slopware"[1] list that Ian shared earlier has a reference to a
recent (March) article[2] where they get LLM's to output verbatim text
that goes a bit further than earlier research. It was the most
interesting part of that list for me.
First they fine-tune an LLM to produce verbatim texts from book
abstracts. This is not surprising, because the text they ask it to
produce is the text that was used to generate the prompt. But then they
also claim that, once fine tuned, they can trigger the LLM to output
verbatim text of books that were not used in the fine-tuning. (Because
those books are somehow 'clustered' in the weights.)
Their main figure contains a graph with "Longest Contiguous Regurgitated
Span" and that is at 20 words before their fine-tuning and at around 400
words after fine-tuning. So still far from incidental extraction, but
interesting nonetheless.
The way it is written makes me believe it is still not 'loophole free',
because I understand that the summary needed to extract the verbatim
text is derived from the original, even for the books not used in the
fine-tuning (if I understand correctly). So you'd still need to have
the original text to begin with. (Also to get those 20 words, I think.)
The article is titled "Whack-a-Mole", but I find that silly w.r.t. to
our context. For this discussion (incorporating LLM code in packaged
software or in Guix itself) it should be enough to make it hard
(impossible?) to extract copyrighted works on accident. I'd say it is
not problematic (for those purposes) if it is possible to extract
copyrighted works deliberately.
Hugo
[1] https://codeberg.org/small-hack/open-slopware
[2] https://arxiv.org/html/2603.20957v2