Effectively each online service will restrict competition from using their resources, as they are paying for infrastructure and energy to make profit, not to make competition stronger. Given the rumors or evidence (I'm not clear on that) that deepseek was trained on some sort of OpenAI resources, I'm not surprised that suppliers become super careful in this regard. Seems that so far safest way to avoid troubles with result license terms is running on-premises LLM. In such case all the costs of infrastructure are on the donor who runs it to produce patches, and the model itself remains a tool similar to IDE or generator. Publisher of model will need to provide a permissive license for the results, thus giving the contributor freedom of use.

Given all above I am doubt if we (at ASF) are able to effectively control and restrict this. If somebody decides to hide information about how patch was actually produced, we have no ways to detect it. Its hard with code produced by people, its equally or even harder with LLM generated code. There are currently no ways to spot LLM results and we will probably wait for tools doing this for us quite long. We do rely on the trust to contributors, that they act in good faith, as they will benefit from submitting fixes/features and continuing to use the project. The disclosure remain contributors duty, however maintenance of compatible tool list spans probably beyond this project or even whole ASF. Problem will affect everyone in the industry. Similar to software licenses, we will end up with tools which are compatible with specific requirements (ie. OSI initiative). Additional factor open source community will need to track will be changes in Terms & Conditions.
So far OSI provides only this guidance: https://opensource.org/ai

Best,
Łukasz


On 8/1/25 21:13, Josh McKenzie wrote:
So I'll go ahead and preface this email - I'm not trying to open Pandora's Box or re-litigate settled things from the thread. /But.../

        • The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition. https://opensource.org/osd/ <https:// opensource.org/osd/>
By that logic, Anthropic's terms would also run afoul of that right?
https://www.anthropic.com/legal/consumer-terms <https:// www.anthropic.com/legal/consumer-terms>
You may not access or use, or help another person to access or use, our Services in the following ways:
...
2. To develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models or resell the Services.
...

Strictly speaking, that collides with the open source definition: https://opensource.org/osd <https://opensource.org/osd>


      6. No Discrimination Against Fields of Endeavor

The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

Which is going to hold true for basically all AI platforms. At least right now, they all have some form of restriction and verbiage discouraging using their services to build competing services.

Gemini, similar terms <https://ai.google.dev/gemini-api/terms>:
You may not use the Services to develop models that compete with the Services (e.g., Gemini API or Google AI Studio). You also may not attempt to reverse engineer, extract or replicate any component of the Services, including the underlying data or models (e.g., parameter weights).
Plus a prohibited use clause.

So ISTM we should either be ok with all of them (i.e. cassandra doesn't compete with any of them and it matches the definition of open-source in the context of our project's usage) or ok with none of them. And I'm heavily in favor of the former interpretation.

On Fri, Aug 1, 2025, at 11:48 AM, David Capwell wrote:


> On Aug 1, 2025, at 6:38 AM, Josh McKenzie <jmcken...@apache.org <mailto:jmcken...@apache.org>> wrote:
>
>> Kimi K2 has similar wording as OpenAI so I assume they are banned as well?
>
> What about the terms is incompatible with the ASF? Looks like you're good to go with whatever you generate?


Go to the "Service Misuse.” section

    • By using Kimi to develop, train, or improve algorithms, models, etc., that are in direct or indirect competition with us;


This is the type of wording that caused OpenAI to be excluded.

In https://www.apache.org/legal/generative-tooling.html <https:// www.apache.org/legal/generative-tooling.html>

        • The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition. https://opensource.org/osd/ <https:// opensource.org/osd/>

Its very possible I misunderstood, but the legal thread called out OpenAI for similar wording which caused them to add this specific section to cover.


>
>> The ownership of the content generated based on Kimi is maintained by you, and you are responsible for its independent judgment and use. Any intellectual property issues arising from the use of content generated by Kimi are handled by you, and we are not responsible for any losses caused thereby. If you cause any loss to us, we have the right to recover from you. > Is there any concern about the transference of "cause any loss to us, we have the right to recover from you."?
>
> Regarding OpenAI's terms, what aspects of them are problematic for ASF donation? Reading through that now:
>> Use Output to develop models that compete with OpenAI.
>
> I don't think that someone in the future using Cassandra with code generated by OpenAI would qualify. i.e. this can't be transitive else it'd poison SBOM's everywhere w/dependency chains.
>
>> Ownership of content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.
> Seems compatible?

Ownership isn’t the concern, so many patches would not be inventing new data structures and likely glueing Cassandra code together, so wouldn’t be copy writable (there may be cases, but we should treat those in isolation as the same concern happens with human code).

OpenAI called out that you can’t use their output to build something to compete with them, which is what was flagged as being against the “Open Source Definition”.

>
> On Fri, Aug 1, 2025, at 6:22 AM, David Capwell wrote:
>>
>>
>> Not really, this thread has made me really see that we need to know the tool/model provider so we can confirm the TOC allows contributions.
>>
>> OpenAI is not allowed and we know most popular ones are, but what about new ones?  Kimi K2 has similar wording as OpenAI so I assume they are banned as well?
>>
>> https://kimi.moonshot.cn/user/agreement/modelUse <https:// kimi.moonshot.cn/user/agreement/modelUse>
>>
>> I don’t know of any tool that currently is excluded, it’s only been model providers so far… but this is moving fast so should also know the tool
>>
>> The best compromise is having a auto approve allow list, if on that list don’t need to disclose?
>>
>> Sent from my iPhone
>>
>>> On Jul 31, 2025, at 9:13 PM, Yifan Cai <yc25c...@gmail.com <mailto:yc25c...@gmail.com>> wrote:
>>>
>>> Does "optionally disclose the LLM used in whatever way you prefer and definitely no OpenAI" meet everyone's expectations?
>>>
>>> - Yifan
>>>
>>> On Thu, Jul 31, 2025 at 1:56 PM Josh McKenzie <jmcken...@apache.org <mailto:jmcken...@apache.org>> wrote:
>>>
>>> Do we have a consensus on this topic or is there still further discussion to be had?
>>>
>>> On Thu, Jul 24, 2025, at 8:26 AM, David Capwell wrote:
>>>>
>>>>
>>>> Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that: The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition. At least one of the following conditions is met: The output is not copyrightable subject matter (and would not be even if produced by a human). No third party materials are included in the output. Any third party materials that are included in the output are being used with permission (e.g., under a compatible open-source license) of the third party copyright holders and in compliance with the applicable license terms. A contributor obtains reasonable certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about output that may be similar to training data, or from code scanning results
>>>> ASF Generative Tooling Guidance
>>>> apache.org <apple-touch-icon-180x180.png>
>>>>
>>>>
>>>> Ariel shared this at the start.  Right now we must know what tool was used so we can make sure its license is ok.  The only tool currently flagged as not acceptable is OpenAI as it has wordings limiting what you may do with its output.
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Jul 23, 2025, at 1:31 PM, Jon Haddad <j...@rustyrazorblade.com <mailto:j...@rustyrazorblade.com>> wrote:
>>>>>
>>>>> +1 to Patrick's proposal.
>>>>>
>>>>> On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin <pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>> wrote: >>>>> I just did some review on all the case law around copywrite and AI code. So far, every claim has been dismissed. There are some other cases like NYTimes which have more merit and are proceeding.
>>>>>
>>>>> Which leads me to the opinion that this is feeling like a premature optimization. Somebody creating a PR should not have to also submit a SBOM, which is essentially what we’re asking. It’s undue burden and friction on the process when we should be looking for ways to reduce friction.
>>>>>
>>>>> My proposal is no disclosures required.
>>>>>
>>>>> On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <yc25c...@gmail.com <mailto:yc25c...@gmail.com>> wrote: >>>>> According to the thread, the disclosure is for legal purposes. For example, the patch is not produced by OpenAI's service. I think having the discussion to clarify the AI usage in the projects is meaningful. I guess many are hesitating because of the unclarity in the area.
>>>>>
>>>>> > I don’t believe or agree with us assuming we should do this for every PR
>>>>>
>>>>> I am with you, David. Updating the mail list for PRs is overwhelming for both the author and the community.
>>>>>
>>>>> I also do not feel co-author is the best place.
>>>>>
>>>>> - Yifan
>>>>>
>>>>> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>> wrote: >>>>> This is starting to get ridiculous. Disclosure statements on exactly how a problem was solved? What’s next? Time cards?
>>>>>
>>>>> It’s time to accept the world as it is. AI is in the coding toolbox now just like IDEs, linters and code formatters. Some may not like using them, some may love using them. What matters is that a problem was solved, the code matches whatever quality standard the project upholds which should be enforced by testing and code reviews.
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell <dcapw...@apple.com <mailto:dcapw...@apple.com>> wrote:
>>>>>>
>>>>>>
>>>>>> David is disclosing it in the maillist and the GH page. Should the disclosure be persisted in the commit?
>>>>>
>>>>>
>>>>> Someone asked me to update the ML, but I don’t believe or agree with us assuming we should do this for every PR; personally storing this in the PR description is fine to me as you are telling the reviewers (who you need to communicate this to).
>>>>>
>>>>>
>>>>>> I’d say we can use the co-authored part of our commit messages to disclose the actual AI that was used?
>>>>>
>>>>>
>>>>> Heh... I kinda feel dirty doing that… No one does that when they take something from a blog or stack overflow, but when you do that you should still attribute by linking… which I guess is what Co-Authored does?
>>>>>
>>>>> I don’t know… feels dirty...
>>>>>
>>>>>
>>>>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella <conta...@bernardobotella.com <mailto:conta...@bernardobotella.com>> wrote:
>>>>>>
>>>>>> That’s a great point. I’d say we can use the co-authored part of our commit messages to disclose the actual AI that was used?
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com <mailto:yc25c...@gmail.com>> wrote:
>>>>>>>
>>>>>>> Curious, what are the good ways to disclose the information?
>>>>>>>
>>>>>>> > All of which comes back to: if people disclose if they used AI, what models, and whether they used the code or text the model wrote verbatim or used it as a scaffolding and then heavily modified everything I think we'll be in a pretty good spot.
>>>>>>>
>>>>>>> David is disclosing it in the maillist and the GH page. Should the disclosure be persisted in the commit?
>>>>>>>
>>>>>>> - Yifan
>>>>>>>
>>>>>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com <mailto:dcapw...@apple.com>> wrote: >>>>>>> Sent out this patch that was written 100% by Claude: https:// github.com/apache/cassandra/pull/4266 <https://github.com/apache/ cassandra/pull/4266>
>>>>>>>
>>>>>>> Claudes license doesn’t have issues with the current ASF policy as far as I can tell.  If you look at the patch it’s very clear there isn’t any copywriter material (its glueing together C* classes).
>>>>>>>
>>>>>>> I could have written this my self but I had to focus on code reviews and also needed this patch out, so asked Claude to write it for me so I could focus on reviews.  I have reviewed it myself and it’s basically the same code I would have written (notice how small and focused the patch is, larger stuff doesn’t normally pass my peer review).
>>>>>>>
>>>>>>>> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com <mailto:dcapw...@apple.com>> wrote:
>>>>>>>>
>>>>>>>> +1 to what Josh said
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org <mailto:jmcken...@apache.org>> wrote:
>>>>>>>>>
>>>>>>>>> Did some more digging. Apparently the way a lot of headline-grabbers have been making models reproduce code verbatim is to prompt them with dozens of verbatim tokens of copyrighted code as input where completion is then very heavily weighted to regurgitate the initial implementation. Which makes sense; if you copy/paste 100 lines of copyrighted code, the statistically likely completion for that will be that initial implementation.
>>>>>>>>>
>>>>>>>>> For local LLM's, the likelihood of verbatim reproduction is differently but apparently comparably unlikely because they have far fewer parameters (32B vs. 671B for Deepseek for instance) of their pre-training corpus of trillions (30T in the case of Qwen3-32B for instance), so the individual tokens from the copyrighted material are highly unlikely to be actually stored in the model to be reproduced, and certainly not in sequence. They don't have the post-generation checks claimed by the SOTA models, but are apparently considered in the "< 1 in 10,000 completions will generate copyrighted code" territory.
>>>>>>>>>
>>>>>>>>> When asked a human language prompt, or a multi-agent pipelined "still human language but from your architect agent" prompt, the likelihood of producing a string of copyrighted code in that manner is statistically very, very low. I think we're at far more risk of contributors copy/pasting stack overflow or code from other projects than we are from modern genAI models producing blocks of copyrighted code.
>>>>>>>>>
>>>>>>>>> All of which comes back to: if people disclose if they used AI, what models, and whether they used the code or text the model wrote verbatim or used it as a scaffolding and then heavily modified everything I think we'll be in a pretty good spot.
>>>>>>>>>
>>>>>>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 2. Models that do not do output filtering to restrict the reproduction of training data unless the tool can ensure the output is license compatible?
>>>>>>>>>>>
>>>>>>>>>>> 2 would basically prohibit locally run models.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am not for this for the reasons listed above. There isn’t a difference between this and a contributor copying code and sending our way. We still need to validate the code can be accepted .
>>>>>>>>>>
>>>>>>>>>> We also have the issue of having this be a broad stroke. If the user asked a model to write a test for the code the human wrote, we reject the contribution as they used a local model? This poses very little copywriting risk yet our policy would now reject
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> 2. Models that do not do output filtering to restrict the reproduction of training data unless the tool can ensure the output is license compatible?
>>>>>>>>>>>
>>>>>>>>>>> 2 would basically prohibit locally run models.
>>>>>>>>>
>>>>>>>>>
>>>
>




Reply via email to