Re: Accepting AI generated contributions

Łukasz Dywicki Fri, 01 Aug 2025 11:00:18 -0700

Effectively each online service will restrict competition from usingtheir resources, as they are paying for infrastructure and energy tomake profit, not to make competition stronger. Given the rumors orevidence (I'm not clear on that) that deepseek was trained on some sortof OpenAI resources, I'm not surprised that suppliers become supercareful in this regard.Seems that so far safest way to avoid troubles with result license termsis running on-premises LLM. In such case all the costs of infrastructureare on the donor who runs it to produce patches, and the model itselfremains a tool similar to IDE or generator. Publisher of model will needto provide a permissive license for the results, thus giving thecontributor freedom of use.

Given all above I am doubt if we (at ASF) are able to effectivelycontrol and restrict this. If somebody decides to hide information abouthow patch was actually produced, we have no ways to detect it. Its hardwith code produced by people, its equally or even harder with LLMgenerated code. There are currently no ways to spot LLM results and wewill probably wait for tools doing this for us quite long. We do rely onthe trust to contributors, that they act in good faith, as they willbenefit from submitting fixes/features and continuing to use theproject. The disclosure remain contributors duty, however maintenance ofcompatible tool list spans probably beyond this project or even whole ASF.Problem will affect everyone in the industry. Similar to softwarelicenses, we will end up with tools which are compatible with specificrequirements (ie. OSI initiative). Additional factor open sourcecommunity will need to track will be changes in Terms & Conditions.

So far OSI provides only this guidance: https://opensource.org/ai


Best,
Łukasz


On 8/1/25 21:13, Josh McKenzie wrote:

So I'll go ahead and preface this email - I'm not trying to openPandora's Box or re-litigate settled things from the thread. /But.../
• The terms and conditions of the generative AI tool do notplace any restrictions on use of the output that would be inconsistentwith the Open Source Definition. https://opensource.org/osd/ <https://opensource.org/osd/>
By that logic, Anthropic's terms would also run afoul of that right?
https://www.anthropic.com/legal/consumer-terms <https://www.anthropic.com/legal/consumer-terms>
You may not access or use, or help another person to access or use,our Services in the following ways:
...
2. To develop any products or services that compete with our Services,including to develop or train any artificial intelligence or machinelearning algorithms or models or resell the Services.
...
Strictly speaking, that collides with the open source definition:https://opensource.org/osd <https://opensource.org/osd>
      6. No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the program ina specific field of endeavor. For example, it may not restrict theprogram from being used in a business, or from being used for geneticresearch.
Which is going to hold true for basically all AI platforms. At leastright now, they all have some form of restriction and verbiagediscouraging using their services to build competing services.
Gemini, similar terms <https://ai.google.dev/gemini-api/terms>:
You may not use the Services to develop models that compete with theServices (e.g., Gemini API or Google AI Studio). You also may notattempt to reverse engineer, extract or replicate any component of theServices, including the underlying data or models (e.g., parameterweights).
Plus a prohibited use clause.
So ISTM we should either be ok with all of them (i.e. cassandra doesn'tcompete with any of them and it matches the definition of open-source inthe context of our project's usage) or ok with none of them. And I'mheavily in favor of the former interpretation.
On Fri, Aug 1, 2025, at 11:48 AM, David Capwell wrote:
> On Aug 1, 2025, at 6:38 AM, Josh McKenzie <[email protected]<mailto:[email protected]>> wrote:
>
>> Kimi K2 has similar wording as OpenAI so I assume they are bannedas well?
>
> What about the terms is incompatible with the ASF? Looks like you'regood to go with whatever you generate?
Go to the "Service Misuse.” section
• By using Kimi to develop, train, or improve algorithms, models,etc., that are in direct or indirect competition with us;
This is the type of wording that caused OpenAI to be excluded.
In https://www.apache.org/legal/generative-tooling.html <https://www.apache.org/legal/generative-tooling.html>
• The terms and conditions of the generative AI tool do notplace any restrictions on use of the output that would be inconsistentwith the Open Source Definition. https://opensource.org/osd/ <https://opensource.org/osd/>
Its very possible I misunderstood, but the legal thread called outOpenAI for similar wording which caused them to add this specificsection to cover.
>
>> The ownership of the content generated based on Kimi is maintainedby you, and you are responsible for its independent judgment and use.Any intellectual property issues arising from the use of contentgenerated by Kimi are handled by you, and we are not responsible forany losses caused thereby. If you cause any loss to us, we have theright to recover from you.> Is there any concern about the transference of "cause any loss tous, we have the right to recover from you."?
>
> Regarding OpenAI's terms, what aspects of them are problematic forASF donation? Reading through that now:
>> Use Output to develop models that compete with OpenAI.
>
> I don't think that someone in the future using Cassandra with codegenerated by OpenAI would qualify. i.e. this can't be transitive elseit'd poison SBOM's everywhere w/dependency chains.
>
>> Ownership of content. As between you and OpenAI, and to the extentpermitted by applicable law, you (a) retain your ownership rights inInput and (b) own the Output. We hereby assign to you all our right,title, and interest, if any, in and to Output.
> Seems compatible?
Ownership isn’t the concern, so many patches would not be inventingnew data structures and likely glueing Cassandra code together, sowouldn’t be copy writable (there may be cases, but we should treatthose in isolation as the same concern happens with human code).
OpenAI called out that you can’t use their output to build somethingto compete with them, which is what was flagged as being against the“Open Source Definition”.
>
> On Fri, Aug 1, 2025, at 6:22 AM, David Capwell wrote:
>>
>>
>> Not really, this thread has made me really see that we need to knowthe tool/model provider so we can confirm the TOC allows contributions.
>>
>> OpenAI is not allowed and we know most popular ones are, but whatabout new ones? Kimi K2 has similar wording as OpenAI so I assumethey are banned as well?
>>
>> https://kimi.moonshot.cn/user/agreement/modelUse <https://kimi.moonshot.cn/user/agreement/modelUse>
>>
>> I don’t know of any tool that currently is excluded, it’s only beenmodel providers so far… but this is moving fast so should also knowthe tool
>>
>> The best compromise is having a auto approve allow list, if on thatlist don’t need to disclose?
>>
>> Sent from my iPhone
>>
>>> On Jul 31, 2025, at 9:13 PM, Yifan Cai <[email protected]<mailto:[email protected]>> wrote:
>>>
>>> Does "optionally disclose the LLM used in whatever way you preferand definitely no OpenAI" meet everyone's expectations?
>>>
>>> - Yifan
>>>
>>> On Thu, Jul 31, 2025 at 1:56 PM Josh McKenzie<[email protected] <mailto:[email protected]>> wrote:
>>>
>>> Do we have a consensus on this topic or is there still furtherdiscussion to be had?
>>>
>>> On Thu, Jul 24, 2025, at 8:26 AM, David Capwell wrote:
>>>>
>>>>
>>>> Given the above, code generated in whole or in part using AI canbe contributed if the contributor ensures that: The terms andconditions of the generative AI tool do not place any restrictions onuse of the output that would be inconsistent with the Open SourceDefinition. At least one of the following conditions is met: Theoutput is not copyrightable subject matter (and would not be even ifproduced by a human). No third party materials are included in theoutput. Any third party materials that are included in the output arebeing used with permission (e.g., under a compatible open-sourcelicense) of the third party copyright holders and in compliance withthe applicable license terms. A contributor obtains reasonablecertainty that conditions 2.2 or 2.3 are met if the AI tool itselfprovides sufficient information about output that may be similar totraining data, or from code scanning results
>>>> ASF Generative Tooling Guidance
>>>> apache.org <apple-touch-icon-180x180.png>
>>>>
>>>>
>>>> Ariel shared this at the start. Right now we must know what toolwas used so we can make sure its license is ok. The only toolcurrently flagged as not acceptable is OpenAI as it has wordingslimiting what you may do with its output.
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Jul 23, 2025, at 1:31 PM, Jon Haddad <[email protected]<mailto:[email protected]>> wrote:
>>>>>
>>>>> +1 to Patrick's proposal.
>>>>>
>>>>> On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin<[email protected] <mailto:[email protected]>> wrote:>>>>> I just did some review on all the case law around copywrite andAI code. So far, every claim has been dismissed. There are some othercases like NYTimes which have more merit and are proceeding.
>>>>>
>>>>> Which leads me to the opinion that this is feeling like apremature optimization. Somebody creating a PR should not have to alsosubmit a SBOM, which is essentially what we’re asking. It’s undueburden and friction on the process when we should be looking for waysto reduce friction.
>>>>>
>>>>> My proposal is no disclosures required.
>>>>>
>>>>> On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <[email protected]<mailto:[email protected]>> wrote:>>>>> According to the thread, the disclosure is for legal purposes.For example, the patch is not produced by OpenAI's service. I thinkhaving the discussion to clarify the AI usage in the projects ismeaningful. I guess many are hesitating because of the unclarity inthe area.
>>>>>
>>>>> > I don’t believe or agree with us assuming we should do thisfor every PR
>>>>>
>>>>> I am with you, David. Updating the mail list for PRs isoverwhelming for both the author and the community.
>>>>>
>>>>> I also do not feel co-author is the best place.
>>>>>
>>>>> - Yifan
>>>>>
>>>>> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin<[email protected] <mailto:[email protected]>> wrote:>>>>> This is starting to get ridiculous. Disclosure statements onexactly how a problem was solved? What’s next? Time cards?
>>>>>
>>>>> It’s time to accept the world as it is. AI is in the codingtoolbox now just like IDEs, linters and code formatters. Some may notlike using them, some may love using them. What matters is that aproblem was solved, the code matches whatever quality standard theproject upholds which should be enforced by testing and code reviews.
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell<[email protected] <mailto:[email protected]>> wrote:
>>>>>>
>>>>>>
>>>>>> David is disclosing it in the maillist and the GH page. Shouldthe disclosure be persisted in the commit?
>>>>>
>>>>>
>>>>> Someone asked me to update the ML, but I don’t believe or agreewith us assuming we should do this for every PR; personally storingthis in the PR description is fine to me as you are telling thereviewers (who you need to communicate this to).
>>>>>
>>>>>
>>>>>> I’d say we can use the co-authored part of our commit messagesto disclose the actual AI that was used?
>>>>>
>>>>>
>>>>> Heh... I kinda feel dirty doing that… No one does that when theytake something from a blog or stack overflow, but when you do that youshould still attribute by linking… which I guess is what Co-Authored does?
>>>>>
>>>>> I don’t know… feels dirty...
>>>>>
>>>>>
>>>>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella<[email protected] <mailto:[email protected]>>wrote:
>>>>>>
>>>>>> That’s a great point. I’d say we can use the co-authored partof our commit messages to disclose the actual AI that was used?
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <[email protected]<mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>> Curious, what are the good ways to disclose the information?
>>>>>>>
>>>>>>> > All of which comes back to: if people disclose if they usedAI, what models, and whether they used the code or text the modelwrote verbatim or used it as a scaffolding and then heavily modifiedeverything I think we'll be in a pretty good spot.
>>>>>>>
>>>>>>> David is disclosing it in the maillist and the GH page. Shouldthe disclosure be persisted in the commit?
>>>>>>>
>>>>>>> - Yifan
>>>>>>>
>>>>>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell<[email protected] <mailto:[email protected]>> wrote:>>>>>>> Sent out this patch that was written 100% by Claude: https://github.com/apache/cassandra/pull/4266 <https://github.com/apache/cassandra/pull/4266>
>>>>>>>
>>>>>>> Claudes license doesn’t have issues with the current ASFpolicy as far as I can tell. If you look at the patch it’s very clearthere isn’t any copywriter material (its glueing together C* classes).
>>>>>>>
>>>>>>> I could have written this my self but I had to focus on codereviews and also needed this patch out, so asked Claude to write itfor me so I could focus on reviews. I have reviewed it myself andit’s basically the same code I would have written (notice how smalland focused the patch is, larger stuff doesn’t normally pass my peerreview).
>>>>>>>
>>>>>>>> On Jun 25, 2025, at 2:37 PM, David Capwell<[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>
>>>>>>>> +1 to what Josh said
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie<[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>
>>>>>>>>> Did some more digging. Apparently the way a lot ofheadline-grabbers have been making models reproduce code verbatim isto prompt them with dozens of verbatim tokens of copyrighted code asinput where completion is then very heavily weighted to regurgitatethe initial implementation. Which makes sense; if you copy/paste 100lines of copyrighted code, the statistically likely completion forthat will be that initial implementation.
>>>>>>>>>
>>>>>>>>> For local LLM's, the likelihood of verbatim reproduction isdifferently but apparently comparably unlikely because they have farfewer parameters (32B vs. 671B for Deepseek for instance) of theirpre-training corpus of trillions (30T in the case of Qwen3-32B forinstance), so the individual tokens from the copyrighted material arehighly unlikely to be actually stored in the model to be reproduced,and certainly not in sequence. They don't have the post-generationchecks claimed by the SOTA models, but are apparently considered inthe "< 1 in 10,000 completions will generate copyrighted code" territory.
>>>>>>>>>
>>>>>>>>> When asked a human language prompt, or a multi-agentpipelined "still human language but from your architect agent" prompt,the likelihood of producing a string of copyrighted code in thatmanner is statistically very, very low. I think we're at far more riskof contributors copy/pasting stack overflow or code from otherprojects than we are from modern genAI models producing blocks ofcopyrighted code.
>>>>>>>>>
>>>>>>>>> All of which comes back to: if people disclose if they usedAI, what models, and whether they used the code or text the modelwrote verbatim or used it as a scaffolding and then heavily modifiedeverything I think we'll be in a pretty good spot.
>>>>>>>>>
>>>>>>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 2. Models that do not do output filtering to restrict thereproduction of training data unless the tool can ensure the output islicense compatible?
>>>>>>>>>>>
>>>>>>>>>>> 2 would basically prohibit locally run models.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am not for this for the reasons listed above. There isn’ta difference between this and a contributor copying code and sendingour way. We still need to validate the code can be accepted .
>>>>>>>>>>
>>>>>>>>>> We also have the issue of having this be a broad stroke. Ifthe user asked a model to write a test for the code the human wrote,we reject the contribution as they used a local model? This poses verylittle copywriting risk yet our policy would now reject
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg<[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> 2. Models that do not do output filtering to restrict thereproduction of training data unless the tool can ensure the output islicense compatible?
>>>>>>>>>>>
>>>>>>>>>>> 2 would basically prohibit locally run models.
>>>>>>>>>
>>>>>>>>>
>>>
>

Re: Accepting AI generated contributions

Reply via email to