whatever) contributions to Gentoo

Eli Schwartz Wed, 28 Feb 2024 12:20:20 -0800

On 2/28/24 6:06 AM, Matt Jolly wrote:
> 
>> But where do we draw the line? Are translation tools like DeepL
>> allowed? I don't see much of a copyright issue for these.
> 
> I'd also like to jump in and play devil's advocate. There's a fair
> chance that this is because I just got back from a
> supercomputing/research conf where LLMs were the hot topic in every
> keynote.
> 
> As mentioned by Sam, this RFC is performative. Any users that are going
> to abuse LLMs are going to do it _anyway_, regardless of the rules. We
> already rely on common sense to filter these out; we're always going to
> have BS/Spam PRs and bugs - I don't really think that the content being
> generated by LLM is really any worse.
> 
> This doesn't mean that I think we should blanket allow poor quality LLM
> contributions. It's especially important that we take into account the
> potential for bias, factual errors, and outright plagarism when these
> tools are used incorrectly.  We already have methods for weeding out low
> quality contributions and bad faith contributors - let's trust in these
> and see what we can do to strengthen these tools and processes.



Why is this an argument *against* performative statement of intent?

There are too many ways for bad faith contributors to maliciously engage
with the community, and no one is proposing a need to lay down rules
that forbid such people.

It is meaningful on its own to specify good faith rules that people
should abide by in order to produce a smoother experience. And telling
people that they are not supposed to do XXX is a good way to reduce the
amount of low quality contributions that Devs need to sift through...


> A bit closer to home for me, what about using a LLMs as an assistive
> technology / to reduce boilerplate? I'm recovering from RSI - I don't
> know when (if...) I'll be able to type like I used to again. If a model
> is able to infer some mostly salvagable boilerplate from its context
> window I'm going to use it and spend the effort I would writing that to
> fix something else; an outright ban on LLM use will reduce my _ability_
> to contribute to the project.


So by this appeal to emotion, you can claim anything is assistive
technology and therefore should be allowed because it's discriminatory
against the disabled if you don't allow it?

Is there some special attribute of disabled persons that means they are
exempted from copyright law?

What counts as assistive technology? Is it any technology that disabled
persons use, or technology designed to bridge the gap for the disabled?
If a disabled person uses vim because shortcuts, does that mean vim is
"assistive technology" because someone used it to "assist" them?

...

I somehow feel like I maybe heard about assistive technology existing
that assisted disabled persons in the process of dictating their
thoughts while avoiding physically stressful typing activities.

It didn't involve having the "assistive technology" provide both the
content and the typing, as that's not really *assisting*.


> In line with the above, if the concern is about code quality / potential
> for plagiarised code, What about indirect use of LLMs? Imagine a
> hypothetical situation where a contributor asks a LLM to summarise a
> topic and uses that knowledge to implement a feature. Is this now
> tainted / forbidden knowledge according to the Gentoo project?


Since your imagined hypothetical involves the use of copyrighted works
by and from a person, which cannot be said to be derivative copyrighted
works of the training data from the LLM -- for the same reason that
reading an article in a handwritten, copyrighted journal about "a topic"
to learn about that topic and then writing software based on the ideas
from the article is not a *derivative copyrighted work* -- the answer is
extremely trivially no?

The copyright issue with LLMs isn't that they ingest blogposts about how
cool ebuilds are and use that knowledge to write ebuilds. The copyright
issue with LLMs is that they ingest github repos full of non-Gentoo
ebuilds copyrighted under who knows what license and then regurgitate
those ebuilds. It is *derivative works*.

Prose summaries about generic topics is a good way to break the link
when it comes to derived works, it doesn't have anything to do with LLMs.


Nonetheless, any credible form of scholarship is going to demand that
participants be well versed in where the line is between saying
something in your own words with citation, and plagiarism.



> As a final not-so-hypothetical, what about a LLM trained on Gentoo docs
> and repos, or more likely trained on exclusively open-source
> contributions and fine-tuned on Gentoo specifics? I'm in the process of
> spinning up several models at work to get a handle on the tech / turn
> more electricity into heat - this is a real possibility (if I can ever
> find the time).


If you can state for a fact that you have done so, then clearly it's not
a copyright violation.

"exclusively open-source contributions" is NOT however a good bar. There
are lots of open-source licenses, but not all of them are compatible
with the GPL2 at all, and the ones that are compatible -- in fact,
licenses in general -- tend to require you to include copyright notices.

The LLM would have to know how to do that. Or if it is trained
exclusively on gentoo repositories it may be able to say "okay all
inputs are copyright GPL2 The Gentoo Authors".


> The cat is out of the bag when it comes to LLMs. In my real-world job I
> talk to scientists and engineers using these things (for their
> strengths) to quickly iterate on designs, to summarise experimental
> results, and even to generate testable hypotheses. We're only going to
> see increasing use of this technology going forward.


Huh? "The cat is out of the bag". What does this even mean? I'm not sure
how to read this other than:

Everyone else is breaking the law anyways so who cares. You can't stop
them, so might as well join them.

If it's something good or acceptable to do, then it is good or
acceptable without needing to be defended by "but lots of people are
doing it so you can't stop us".

That being said, here's some food for thought: if something bad happens,
and we *agree* it's bad, but every time the topic comes up people say
"well, it's bad but everyone else is doing it so what can we do, might
as well give in"...

... how do you think it became so popular to begin with? Maybe someone
before you said "the cat is out of the bag"?



-- 
Eli Schwartz

OpenPGP_0x84818A6819AF4A9B.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo

Reply via email to