Re: Accepting AI generated contributions

Ariel Weisberg Thu, 14 Aug 2025 10:27:04 -0700

Hi,

I want to dig a little deeper into the actual ToS and make a distinction 
between the terms placing a burden on the output of the model and placing a 
burden on access/usage.


Here are the Claude consumer ToS that seem relevant:
```
You may not access or use, or help another person to access or use, our 
Services in the following ways:
 1. To develop any products or services that compete with our Services, 
including to develop or train any artificial intelligence or machine learning 
algorithms or models or resell the Services.
```

And the commercial ToS:
```
 1. *Use Restrictions.* Customer may not and must not attempt to (a) access the 
Services to build a competing product or service, including to train competing 
AI models or resell the Services except as expressly approved by Anthropic; (b) 
reverse engineer or duplicate the Services; or (c) support any third party’s 
attempt at any of the conduct restricted in this sentence.
```
One way to interpret this is there is a burden on the access/usage and if what 
you are doing when you access/use is acceptable then the output is 
unencumbered. So for example if you are developing code for Apache Cassandra 
and you generate something for that purpose then your access was not any one of 
(a) or (b) and it would be a very large stretch to say that contributing that 
code to ASF contributes to (c).

So unless I hear legal say otherwise I would say those ToS are acceptable.

Now let's look at OpenAI's terms which state:
```
 • Use Output to develop models that compete with OpenAI.
```
This is more concerning because it's restriction on output not access.

Gemini has restrictions on "generating or distributing content that 
facilitates:... Spam, phishing, or malware"
and that is a little concerning because it sounds like it encumbers the output 
of the model not the access.

It really really sucks to be in the position of trying to be a lawyer for every 
single service's ToS.

Ariel

On Thu, Aug 14, 2025, at 12:36 PM, Ariel Weisberg wrote:
> Hi,
> 
> It's not up to us to interpret right? It's been interpreted by Apache Legal 
> and if we are confused we can check, but this is one instance where they 
> aren't being ambiguous or delegating to us to make a decision.
> 
> I can't see how we can follow legal's guidance and accept output from models 
> or services running models with these issues.
> 
> This isn't even a change of what we settled on right? We seemed to broadly 
> agree that we wouldn't accept output from models that aren't license 
> compatible. What has changed is we have realized is that it applies to more 
> models.
> 
> At this point I don't think we should try to maintain a list. We should 
> provide a brief guidance that we don't accept code from models/services that 
> are not license compatible (and highlight that this is most popular services) 
> and encourage people to watch out for models/services that might reproduce 
> license incompatible training data.
> 
> Ariel
> 
> On Fri, Aug 1, 2025, at 1:13 PM, Josh McKenzie wrote:
>> So I'll go ahead and preface this email - I'm not trying to open Pandora's 
>> Box or re-litigate settled things from the thread. *But...*
>> 
>>>         • The terms and conditions of the generative AI tool do not place 
>>> any restrictions on use of the output that would be inconsistent with the 
>>> Open Source Definition. https://opensource.org/osd/
>> By that logic, Anthropic's terms would also run afoul of that right?
>> https://www.anthropic.com/legal/consumer-terms
>>> You may not access or use, or help another person to access or use, our 
>>> Services in the following ways:
>>> ...
>>> 2. To develop any products or services that compete with our Services, 
>>> including to develop or train any artificial intelligence or machine 
>>> learning algorithms or models or resell the Services.
>>> ...
>> 
>> Strictly speaking, that collides with the open source definition: 
>> https://opensource.org/osd
>>> 6. No Discrimination Against Fields of Endeavor
>>> 
>>> The license must not restrict anyone from making use of the program in a 
>>> specific field of endeavor. For example, it may not restrict the program 
>>> from being used in a business, or from being used for genetic research.
>> 
>> Which is going to hold true for basically all AI platforms. At least right 
>> now, they all have some form of restriction and verbiage discouraging using 
>> their services to build competing services.
>> 
>> Gemini, similar terms <https://ai.google.dev/gemini-api/terms>:
>>> You may not use the Services to develop models that compete with the 
>>> Services (e.g., Gemini API or Google AI Studio). You also may not attempt 
>>> to reverse engineer, extract or replicate any component of the Services, 
>>> including the underlying data or models (e.g., parameter weights).
>> Plus a prohibited use clause.
>> 
>> So ISTM we should either be ok with all of them (i.e. cassandra doesn't 
>> compete with any of them and it matches the definition of open-source in the 
>> context of our project's usage) or ok with none of them. And I'm heavily in 
>> favor of the former interpretation.
>> 
>> On Fri, Aug 1, 2025, at 11:48 AM, David Capwell wrote:
>>> 
>>> 
>>> > On Aug 1, 2025, at 6:38 AM, Josh McKenzie <[email protected]> wrote:
>>> > 
>>> >> Kimi K2 has similar wording as OpenAI so I assume they are banned as 
>>> >> well? 
>>> > 
>>> > What about the terms is incompatible with the ASF? Looks like you're good 
>>> > to go with whatever you generate?
>>> 
>>> 
>>> Go to the "Service Misuse.” section
>>> 
>>>     • By using Kimi to develop, train, or improve algorithms, models, etc., 
>>> that are in direct or indirect competition with us;
>>> 
>>> 
>>> This is the type of wording that caused OpenAI to be excluded. 
>>> 
>>> In https://www.apache.org/legal/generative-tooling.html
>>> 
>>>         • The terms and conditions of the generative AI tool do not place 
>>> any restrictions on use of the output that would be inconsistent with the 
>>> Open Source Definition. https://opensource.org/osd/
>>> 
>>> Its very possible I misunderstood, but the legal thread called out OpenAI 
>>> for similar wording which caused them to add this specific section to cover.
>>> 
>>> 
>>> > 
>>> >> The ownership of the content generated based on Kimi is maintained by 
>>> >> you, and you are responsible for its independent judgment and use. Any 
>>> >> intellectual property issues arising from the use of content generated 
>>> >> by Kimi are handled by you, and we are not responsible for any losses 
>>> >> caused thereby. If you cause any loss to us, we have the right to 
>>> >> recover from you.
>>> > Is there any concern about the transference of "cause any loss to us, we 
>>> > have the right to recover from you."?
>>> > 
>>> > Regarding OpenAI's terms, what aspects of them are problematic for ASF 
>>> > donation? Reading through that now:
>>> >> Use Output to develop models that compete with OpenAI.
>>> > 
>>> > I don't think that someone in the future using Cassandra with code 
>>> > generated by OpenAI would qualify. i.e. this can't be transitive else 
>>> > it'd poison SBOM's everywhere w/dependency chains.
>>> > 
>>> >> Ownership of content. As between you and OpenAI, and to the extent 
>>> >> permitted by applicable law, you (a) retain your ownership rights in 
>>> >> Input and (b) own the Output. We hereby assign to you all our right, 
>>> >> title, and interest, if any, in and to Output. 
>>> > Seems compatible?
>>> 
>>> Ownership isn’t the concern, so many patches would not be inventing new 
>>> data structures and likely glueing Cassandra code together, so wouldn’t be 
>>> copy writable (there may be cases, but we should treat those in isolation 
>>> as the same concern happens with human code).
>>> 
>>> OpenAI called out that you can’t use their output to build something to 
>>> compete with them, which is what was flagged as being against the “Open 
>>> Source Definition”.
>>> 
>>> > 
>>> > On Fri, Aug 1, 2025, at 6:22 AM, David Capwell wrote:
>>> >> 
>>> >> 
>>> >> Not really, this thread has made me really see that we need to know the 
>>> >> tool/model provider so we can confirm the TOC allows contributions.
>>> >> 
>>> >> OpenAI is not allowed and we know most popular ones are, but what about 
>>> >> new ones?  Kimi K2 has similar wording as OpenAI so I assume they are 
>>> >> banned as well? 
>>> >> 
>>> >> https://kimi.moonshot.cn/user/agreement/modelUse
>>> >> 
>>> >> I don’t know of any tool that currently is excluded, it’s only been 
>>> >> model providers so far… but this is moving fast so should also know the 
>>> >> tool
>>> >> 
>>> >> The best compromise is having a auto approve allow list, if on that list 
>>> >> don’t need to disclose?
>>> >> 
>>> >> Sent from my iPhone
>>> >> 
>>> >>> On Jul 31, 2025, at 9:13 PM, Yifan Cai <[email protected]> wrote:
>>> >>> 
>>> >>> Does "optionally disclose the LLM used in whatever way you prefer and 
>>> >>> definitely no OpenAI" meet everyone's expectations?
>>> >>> 
>>> >>> - Yifan
>>> >>> 
>>> >>> On Thu, Jul 31, 2025 at 1:56 PM Josh McKenzie <[email protected]> 
>>> >>> wrote:
>>> >>> 
>>> >>> Do we have a consensus on this topic or is there still further 
>>> >>> discussion to be had?
>>> >>> 
>>> >>> On Thu, Jul 24, 2025, at 8:26 AM, David Capwell wrote:
>>> >>>> 
>>> >>>> 
>>> >>>> Given the above, code generated in whole or in part using AI can be 
>>> >>>> contributed if the contributor ensures that: The terms and conditions 
>>> >>>> of the generative AI tool do not place any restrictions on use of the 
>>> >>>> output that would be inconsistent with the Open Source Definition. At 
>>> >>>> least one of the following conditions is met: The output is not 
>>> >>>> copyrightable subject matter (and would not be even if produced by a 
>>> >>>> human). No third party materials are included in the output. Any third 
>>> >>>> party materials that are included in the output are being used with 
>>> >>>> permission (e.g., under a compatible open-source license) of the third 
>>> >>>> party copyright holders and in compliance with the applicable license 
>>> >>>> terms. A contributor obtains reasonable certainty that conditions 2.2 
>>> >>>> or 2.3 are met if the AI tool itself provides sufficient information 
>>> >>>> about output that may be similar to training data, or from code 
>>> >>>> scanning results
>>> >>>> ASF Generative Tooling Guidance
>>> >>>> apache.org <apple-touch-icon-180x180.png>
>>> >>>> 
>>> >>>> 
>>> >>>> Ariel shared this at the start.  Right now we must know what tool was 
>>> >>>> used so we can make sure its license is ok.  The only tool currently 
>>> >>>> flagged as not acceptable is OpenAI as it has wordings limiting what 
>>> >>>> you may do with its output.
>>> >>>> 
>>> >>>> Sent from my iPhone
>>> >>>> 
>>> >>>>> On Jul 23, 2025, at 1:31 PM, Jon Haddad <[email protected]> 
>>> >>>>> wrote:
>>> >>>>> 
>>> >>>>> +1 to Patrick's proposal.
>>> >>>>> 
>>> >>>>> On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin <[email protected]> 
>>> >>>>> wrote:
>>> >>>>> I just did some review on all the case law around copywrite and AI 
>>> >>>>> code. So far, every claim has been dismissed. There are some other 
>>> >>>>> cases like NYTimes which have more merit and are proceeding. 
>>> >>>>> 
>>> >>>>> Which leads me to the opinion that this is feeling like a premature 
>>> >>>>> optimization. Somebody creating a PR should not have to also submit a 
>>> >>>>> SBOM, which is essentially what we’re asking. It’s undue burden and 
>>> >>>>> friction on the process when we should be looking for ways to reduce 
>>> >>>>> friction. 
>>> >>>>> 
>>> >>>>> My proposal is no disclosures required. 
>>> >>>>> 
>>> >>>>> On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <[email protected]> wrote:
>>> >>>>> According to the thread, the disclosure is for legal purposes. For 
>>> >>>>> example, the patch is not produced by OpenAI's service. I think 
>>> >>>>> having the discussion to clarify the AI usage in the projects is 
>>> >>>>> meaningful. I guess many are hesitating because of the unclarity in 
>>> >>>>> the area. 
>>> >>>>> 
>>> >>>>> > I don’t believe or agree with us assuming we should do this for 
>>> >>>>> > every PR
>>> >>>>> 
>>> >>>>> I am with you, David. Updating the mail list for PRs is overwhelming 
>>> >>>>> for both the author and the community. 
>>> >>>>> 
>>> >>>>> I also do not feel co-author is the best place. 
>>> >>>>> 
>>> >>>>> - Yifan
>>> >>>>> 
>>> >>>>> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <[email protected]> 
>>> >>>>> wrote:
>>> >>>>> This is starting to get ridiculous. Disclosure statements on exactly 
>>> >>>>> how a problem was solved? What’s next? Time cards? 
>>> >>>>> 
>>> >>>>> It’s time to accept the world as it is. AI is in the coding toolbox 
>>> >>>>> now just like IDEs, linters and code formatters. Some may not like 
>>> >>>>> using them, some may love using them. What matters is that a problem 
>>> >>>>> was solved, the code matches whatever quality standard the project 
>>> >>>>> upholds which should be enforced by testing and code reviews. 
>>> >>>>> 
>>> >>>>> Patrick
>>> >>>>> 
>>> >>>>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell <[email protected]> 
>>> >>>>> wrote:
>>> >>>>>> 
>>> >>>>>> 
>>> >>>>>> David is disclosing it in the maillist and the GH page. Should the 
>>> >>>>>> disclosure be persisted in the commit? 
>>> >>>>> 
>>> >>>>> 
>>> >>>>> Someone asked me to update the ML, but I don’t believe or agree with 
>>> >>>>> us assuming we should do this for every PR; personally storing this 
>>> >>>>> in the PR description is fine to me as you are telling the reviewers 
>>> >>>>> (who you need to communicate this to).
>>> >>>>> 
>>> >>>>> 
>>> >>>>>> I’d say we can use the co-authored part of our commit messages to 
>>> >>>>>> disclose the actual AI that was used? 
>>> >>>>> 
>>> >>>>> 
>>> >>>>> Heh... I kinda feel dirty doing that… No one does that when they take 
>>> >>>>> something from a blog or stack overflow, but when you do that you 
>>> >>>>> should still attribute by linking… which I guess is what Co-Authored 
>>> >>>>> does?
>>> >>>>> 
>>> >>>>> I don’t know… feels dirty...
>>> >>>>> 
>>> >>>>> 
>>> >>>>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella 
>>> >>>>>> <[email protected]> wrote:
>>> >>>>>> 
>>> >>>>>> That’s a great point. I’d say we can use the co-authored part of our 
>>> >>>>>> commit messages to disclose the actual AI that was used? 
>>> >>>>>> 
>>> >>>>>> 
>>> >>>>>> 
>>> >>>>>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <[email protected]> wrote:
>>> >>>>>>> 
>>> >>>>>>> Curious, what are the good ways to disclose the information? 
>>> >>>>>>> 
>>> >>>>>>> > All of which comes back to: if people disclose if they used AI, 
>>> >>>>>>> > what models, and whether they used the code or text the model 
>>> >>>>>>> > wrote verbatim or used it as a scaffolding and then heavily 
>>> >>>>>>> > modified everything I think we'll be in a pretty good spot.
>>> >>>>>>> 
>>> >>>>>>> David is disclosing it in the maillist and the GH page. Should the 
>>> >>>>>>> disclosure be persisted in the commit? 
>>> >>>>>>> 
>>> >>>>>>> - Yifan
>>> >>>>>>> 
>>> >>>>>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <[email protected]> 
>>> >>>>>>> wrote:
>>> >>>>>>> Sent out this patch that was written 100% by Claude: 
>>> >>>>>>> https://github.com/apache/cassandra/pull/4266
>>> >>>>>>> 
>>> >>>>>>> Claudes license doesn’t have issues with the current ASF policy as 
>>> >>>>>>> far as I can tell.  If you look at the patch it’s very clear there 
>>> >>>>>>> isn’t any copywriter material (its glueing together C* classes).
>>> >>>>>>> 
>>> >>>>>>> I could have written this my self but I had to focus on code 
>>> >>>>>>> reviews and also needed this patch out, so asked Claude to write it 
>>> >>>>>>> for me so I could focus on reviews.  I have reviewed it myself and 
>>> >>>>>>> it’s basically the same code I would have written (notice how small 
>>> >>>>>>> and focused the patch is, larger stuff doesn’t normally pass my 
>>> >>>>>>> peer review).
>>> >>>>>>> 
>>> >>>>>>>> On Jun 25, 2025, at 2:37 PM, David Capwell <[email protected]> 
>>> >>>>>>>> wrote:
>>> >>>>>>>> 
>>> >>>>>>>> +1 to what Josh said
>>> >>>>>>>> Sent from my iPhone
>>> >>>>>>>> 
>>> >>>>>>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <[email protected]> 
>>> >>>>>>>>> wrote:
>>> >>>>>>>>> 
>>> >>>>>>>>> Did some more digging. Apparently the way a lot of 
>>> >>>>>>>>> headline-grabbers have been making models reproduce code verbatim 
>>> >>>>>>>>> is to prompt them with dozens of verbatim tokens of copyrighted 
>>> >>>>>>>>> code as input where completion is then very heavily weighted to 
>>> >>>>>>>>> regurgitate the initial implementation. Which makes sense; if you 
>>> >>>>>>>>> copy/paste 100 lines of copyrighted code, the statistically 
>>> >>>>>>>>> likely completion for that will be that initial implementation.
>>> >>>>>>>>> 
>>> >>>>>>>>> For local LLM's, the likelihood of verbatim reproduction is 
>>> >>>>>>>>> differently but apparently comparably unlikely because they have 
>>> >>>>>>>>> far fewer parameters (32B vs. 671B for Deepseek for instance) of 
>>> >>>>>>>>> their pre-training corpus of trillions (30T in the case of 
>>> >>>>>>>>> Qwen3-32B for instance), so the individual tokens from the 
>>> >>>>>>>>> copyrighted material are highly unlikely to be actually stored in 
>>> >>>>>>>>> the model to be reproduced, and certainly not in sequence. They 
>>> >>>>>>>>> don't have the post-generation checks claimed by the SOTA models, 
>>> >>>>>>>>> but are apparently considered in the "< 1 in 10,000 completions 
>>> >>>>>>>>> will generate copyrighted code" territory.
>>> >>>>>>>>> 
>>> >>>>>>>>> When asked a human language prompt, or a multi-agent pipelined 
>>> >>>>>>>>> "still human language but from your architect agent" prompt, the 
>>> >>>>>>>>> likelihood of producing a string of copyrighted code in that 
>>> >>>>>>>>> manner is statistically very, very low. I think we're at far more 
>>> >>>>>>>>> risk of contributors copy/pasting stack overflow or code from 
>>> >>>>>>>>> other projects than we are from modern genAI models producing 
>>> >>>>>>>>> blocks of copyrighted code.
>>> >>>>>>>>> 
>>> >>>>>>>>> All of which comes back to: if people disclose if they used AI, 
>>> >>>>>>>>> what models, and whether they used the code or text the model 
>>> >>>>>>>>> wrote verbatim or used it as a scaffolding and then heavily 
>>> >>>>>>>>> modified everything I think we'll be in a pretty good spot.
>>> >>>>>>>>> 
>>> >>>>>>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>> >>>>>>>>>> 
>>> >>>>>>>>>> 
>>> >>>>>>>>>> 
>>> >>>>>>>>>>> 2. Models that do not do output filtering to restrict the 
>>> >>>>>>>>>>> reproduction of training data unless the tool can ensure the 
>>> >>>>>>>>>>> output is license compatible?
>>> >>>>>>>>>>> 
>>> >>>>>>>>>>> 2 would basically prohibit locally run models.
>>> >>>>>>>>>> 
>>> >>>>>>>>>> 
>>> >>>>>>>>>> I am not for this for the reasons listed above. There isn’t a 
>>> >>>>>>>>>> difference between this and a contributor copying code and 
>>> >>>>>>>>>> sending our way. We still need to validate the code can be 
>>> >>>>>>>>>> accepted .
>>> >>>>>>>>>> 
>>> >>>>>>>>>> We also have the issue of having this be a broad stroke. If the 
>>> >>>>>>>>>> user asked a model to write a test for the code the human wrote, 
>>> >>>>>>>>>> we reject the contribution as they used a local model? This 
>>> >>>>>>>>>> poses very little copywriting risk yet our policy would now 
>>> >>>>>>>>>> reject
>>> >>>>>>>>>> 
>>> >>>>>>>>>> Sent from my iPhone
>>> >>>>>>>>>> 
>>> >>>>>>>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <[email protected]> 
>>> >>>>>>>>>>> wrote:
>>> >>>>>>>>>>> 
>>> >>>>>>>>>>> 2. Models that do not do output filtering to restrict the 
>>> >>>>>>>>>>> reproduction of training data unless the tool can ensure the 
>>> >>>>>>>>>>> output is license compatible?
>>> >>>>>>>>>>> 
>>> >>>>>>>>>>> 2 would basically prohibit locally run models.
>>> >>>>>>>>> 
>>> >>>>>>>>> 
>>> >>> 
>>> > 
>>> 
>>> 
>> 
>

Re: Accepting AI generated contributions

Reply via email to