Re: Accepting AI generated contributions

Josh McKenzie Fri, 01 Aug 2025 10:14:25 -0700

So I'll go ahead and preface this email - I'm not trying to open Pandora's Box 
or re-litigate settled things from the thread. *But...*


>         • The terms and conditions of the generative AI tool do not place any 
> restrictions on use of the output that would be inconsistent with the Open 
> Source Definition. https://opensource.org/osd/
By that logic, Anthropic's terms would also run afoul of that right?
https://www.anthropic.com/legal/consumer-terms
> You may not access or use, or help another person to access or use, our 
> Services in the following ways:
> ...
> 2. To develop any products or services that compete with our Services, 
> including to develop or train any artificial intelligence or machine learning 
> algorithms or models or resell the Services.
> ...

Strictly speaking, that collides with the open source definition: 
https://opensource.org/osd
> 6. No Discrimination Against Fields of Endeavor
> 
> The license must not restrict anyone from making use of the program in a 
> specific field of endeavor. For example, it may not restrict the program from 
> being used in a business, or from being used for genetic research.

Which is going to hold true for basically all AI platforms. At least right now, 
they all have some form of restriction and verbiage discouraging using their 
services to build competing services.

Gemini, similar terms <https://ai.google.dev/gemini-api/terms>:
> You may not use the Services to develop models that compete with the Services 
> (e.g., Gemini API or Google AI Studio). You also may not attempt to reverse 
> engineer, extract or replicate any component of the Services, including the 
> underlying data or models (e.g., parameter weights).
Plus a prohibited use clause.

So ISTM we should either be ok with all of them (i.e. cassandra doesn't compete 
with any of them and it matches the definition of open-source in the context of 
our project's usage) or ok with none of them. And I'm heavily in favor of the 
former interpretation.

On Fri, Aug 1, 2025, at 11:48 AM, David Capwell wrote:
> 
> 
> > On Aug 1, 2025, at 6:38 AM, Josh McKenzie <[email protected]> wrote:
> > 
> >> Kimi K2 has similar wording as OpenAI so I assume they are banned as well? 
> > 
> > What about the terms is incompatible with the ASF? Looks like you're good 
> > to go with whatever you generate?
> 
> 
> Go to the "Service Misuse.” section
> 
>     • By using Kimi to develop, train, or improve algorithms, models, etc., 
> that are in direct or indirect competition with us;
> 
> 
> This is the type of wording that caused OpenAI to be excluded. 
> 
> In https://www.apache.org/legal/generative-tooling.html
> 
>         • The terms and conditions of the generative AI tool do not place any 
> restrictions on use of the output that would be inconsistent with the Open 
> Source Definition. https://opensource.org/osd/
> 
> Its very possible I misunderstood, but the legal thread called out OpenAI for 
> similar wording which caused them to add this specific section to cover.
> 
> 
> > 
> >> The ownership of the content generated based on Kimi is maintained by you, 
> >> and you are responsible for its independent judgment and use. Any 
> >> intellectual property issues arising from the use of content generated by 
> >> Kimi are handled by you, and we are not responsible for any losses caused 
> >> thereby. If you cause any loss to us, we have the right to recover from 
> >> you.
> > Is there any concern about the transference of "cause any loss to us, we 
> > have the right to recover from you."?
> > 
> > Regarding OpenAI's terms, what aspects of them are problematic for ASF 
> > donation? Reading through that now:
> >> Use Output to develop models that compete with OpenAI.
> > 
> > I don't think that someone in the future using Cassandra with code 
> > generated by OpenAI would qualify. i.e. this can't be transitive else it'd 
> > poison SBOM's everywhere w/dependency chains.
> > 
> >> Ownership of content. As between you and OpenAI, and to the extent 
> >> permitted by applicable law, you (a) retain your ownership rights in Input 
> >> and (b) own the Output. We hereby assign to you all our right, title, and 
> >> interest, if any, in and to Output. 
> > Seems compatible?
> 
> Ownership isn’t the concern, so many patches would not be inventing new data 
> structures and likely glueing Cassandra code together, so wouldn’t be copy 
> writable (there may be cases, but we should treat those in isolation as the 
> same concern happens with human code).
> 
> OpenAI called out that you can’t use their output to build something to 
> compete with them, which is what was flagged as being against the “Open 
> Source Definition”.
> 
> > 
> > On Fri, Aug 1, 2025, at 6:22 AM, David Capwell wrote:
> >> 
> >> 
> >> Not really, this thread has made me really see that we need to know the 
> >> tool/model provider so we can confirm the TOC allows contributions.
> >> 
> >> OpenAI is not allowed and we know most popular ones are, but what about 
> >> new ones?  Kimi K2 has similar wording as OpenAI so I assume they are 
> >> banned as well? 
> >> 
> >> https://kimi.moonshot.cn/user/agreement/modelUse
> >> 
> >> I don’t know of any tool that currently is excluded, it’s only been model 
> >> providers so far… but this is moving fast so should also know the tool
> >> 
> >> The best compromise is having a auto approve allow list, if on that list 
> >> don’t need to disclose?
> >> 
> >> Sent from my iPhone
> >> 
> >>> On Jul 31, 2025, at 9:13 PM, Yifan Cai <[email protected]> wrote:
> >>> 
> >>> Does "optionally disclose the LLM used in whatever way you prefer and 
> >>> definitely no OpenAI" meet everyone's expectations?
> >>> 
> >>> - Yifan
> >>> 
> >>> On Thu, Jul 31, 2025 at 1:56 PM Josh McKenzie <[email protected]> 
> >>> wrote:
> >>> 
> >>> Do we have a consensus on this topic or is there still further discussion 
> >>> to be had?
> >>> 
> >>> On Thu, Jul 24, 2025, at 8:26 AM, David Capwell wrote:
> >>>> 
> >>>> 
> >>>> Given the above, code generated in whole or in part using AI can be 
> >>>> contributed if the contributor ensures that: The terms and conditions of 
> >>>> the generative AI tool do not place any restrictions on use of the 
> >>>> output that would be inconsistent with the Open Source Definition. At 
> >>>> least one of the following conditions is met: The output is not 
> >>>> copyrightable subject matter (and would not be even if produced by a 
> >>>> human). No third party materials are included in the output. Any third 
> >>>> party materials that are included in the output are being used with 
> >>>> permission (e.g., under a compatible open-source license) of the third 
> >>>> party copyright holders and in compliance with the applicable license 
> >>>> terms. A contributor obtains reasonable certainty that conditions 2.2 or 
> >>>> 2.3 are met if the AI tool itself provides sufficient information about 
> >>>> output that may be similar to training data, or from code scanning 
> >>>> results
> >>>> ASF Generative Tooling Guidance
> >>>> apache.org <apple-touch-icon-180x180.png>
> >>>> 
> >>>> 
> >>>> Ariel shared this at the start.  Right now we must know what tool was 
> >>>> used so we can make sure its license is ok.  The only tool currently 
> >>>> flagged as not acceptable is OpenAI as it has wordings limiting what you 
> >>>> may do with its output.
> >>>> 
> >>>> Sent from my iPhone
> >>>> 
> >>>>> On Jul 23, 2025, at 1:31 PM, Jon Haddad <[email protected]> 
> >>>>> wrote:
> >>>>> 
> >>>>> +1 to Patrick's proposal.
> >>>>> 
> >>>>> On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin <[email protected]> 
> >>>>> wrote:
> >>>>> I just did some review on all the case law around copywrite and AI 
> >>>>> code. So far, every claim has been dismissed. There are some other 
> >>>>> cases like NYTimes which have more merit and are proceeding. 
> >>>>> 
> >>>>> Which leads me to the opinion that this is feeling like a premature 
> >>>>> optimization. Somebody creating a PR should not have to also submit a 
> >>>>> SBOM, which is essentially what we’re asking. It’s undue burden and 
> >>>>> friction on the process when we should be looking for ways to reduce 
> >>>>> friction. 
> >>>>> 
> >>>>> My proposal is no disclosures required. 
> >>>>> 
> >>>>> On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <[email protected]> wrote:
> >>>>> According to the thread, the disclosure is for legal purposes. For 
> >>>>> example, the patch is not produced by OpenAI's service. I think having 
> >>>>> the discussion to clarify the AI usage in the projects is meaningful. I 
> >>>>> guess many are hesitating because of the unclarity in the area. 
> >>>>> 
> >>>>> > I don’t believe or agree with us assuming we should do this for every 
> >>>>> > PR
> >>>>> 
> >>>>> I am with you, David. Updating the mail list for PRs is overwhelming 
> >>>>> for both the author and the community. 
> >>>>> 
> >>>>> I also do not feel co-author is the best place. 
> >>>>> 
> >>>>> - Yifan
> >>>>> 
> >>>>> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin <[email protected]> 
> >>>>> wrote:
> >>>>> This is starting to get ridiculous. Disclosure statements on exactly 
> >>>>> how a problem was solved? What’s next? Time cards? 
> >>>>> 
> >>>>> It’s time to accept the world as it is. AI is in the coding toolbox now 
> >>>>> just like IDEs, linters and code formatters. Some may not like using 
> >>>>> them, some may love using them. What matters is that a problem was 
> >>>>> solved, the code matches whatever quality standard the project upholds 
> >>>>> which should be enforced by testing and code reviews. 
> >>>>> 
> >>>>> Patrick
> >>>>> 
> >>>>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell <[email protected]> 
> >>>>> wrote:
> >>>>>> 
> >>>>>> 
> >>>>>> David is disclosing it in the maillist and the GH page. Should the 
> >>>>>> disclosure be persisted in the commit? 
> >>>>> 
> >>>>> 
> >>>>> Someone asked me to update the ML, but I don’t believe or agree with us 
> >>>>> assuming we should do this for every PR; personally storing this in the 
> >>>>> PR description is fine to me as you are telling the reviewers (who you 
> >>>>> need to communicate this to).
> >>>>> 
> >>>>> 
> >>>>>> I’d say we can use the co-authored part of our commit messages to 
> >>>>>> disclose the actual AI that was used? 
> >>>>> 
> >>>>> 
> >>>>> Heh... I kinda feel dirty doing that… No one does that when they take 
> >>>>> something from a blog or stack overflow, but when you do that you 
> >>>>> should still attribute by linking… which I guess is what Co-Authored 
> >>>>> does?
> >>>>> 
> >>>>> I don’t know… feels dirty...
> >>>>> 
> >>>>> 
> >>>>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella 
> >>>>>> <[email protected]> wrote:
> >>>>>> 
> >>>>>> That’s a great point. I’d say we can use the co-authored part of our 
> >>>>>> commit messages to disclose the actual AI that was used? 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <[email protected]> wrote:
> >>>>>>> 
> >>>>>>> Curious, what are the good ways to disclose the information? 
> >>>>>>> 
> >>>>>>> > All of which comes back to: if people disclose if they used AI, 
> >>>>>>> > what models, and whether they used the code or text the model wrote 
> >>>>>>> > verbatim or used it as a scaffolding and then heavily modified 
> >>>>>>> > everything I think we'll be in a pretty good spot.
> >>>>>>> 
> >>>>>>> David is disclosing it in the maillist and the GH page. Should the 
> >>>>>>> disclosure be persisted in the commit? 
> >>>>>>> 
> >>>>>>> - Yifan
> >>>>>>> 
> >>>>>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <[email protected]> 
> >>>>>>> wrote:
> >>>>>>> Sent out this patch that was written 100% by Claude: 
> >>>>>>> https://github.com/apache/cassandra/pull/4266
> >>>>>>> 
> >>>>>>> Claudes license doesn’t have issues with the current ASF policy as 
> >>>>>>> far as I can tell.  If you look at the patch it’s very clear there 
> >>>>>>> isn’t any copywriter material (its glueing together C* classes).
> >>>>>>> 
> >>>>>>> I could have written this my self but I had to focus on code reviews 
> >>>>>>> and also needed this patch out, so asked Claude to write it for me so 
> >>>>>>> I could focus on reviews.  I have reviewed it myself and it’s 
> >>>>>>> basically the same code I would have written (notice how small and 
> >>>>>>> focused the patch is, larger stuff doesn’t normally pass my peer 
> >>>>>>> review).
> >>>>>>> 
> >>>>>>>> On Jun 25, 2025, at 2:37 PM, David Capwell <[email protected]> 
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>> +1 to what Josh said
> >>>>>>>> Sent from my iPhone
> >>>>>>>> 
> >>>>>>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <[email protected]> 
> >>>>>>>>> wrote:
> >>>>>>>>> 
> >>>>>>>>> Did some more digging. Apparently the way a lot of 
> >>>>>>>>> headline-grabbers have been making models reproduce code verbatim 
> >>>>>>>>> is to prompt them with dozens of verbatim tokens of copyrighted 
> >>>>>>>>> code as input where completion is then very heavily weighted to 
> >>>>>>>>> regurgitate the initial implementation. Which makes sense; if you 
> >>>>>>>>> copy/paste 100 lines of copyrighted code, the statistically likely 
> >>>>>>>>> completion for that will be that initial implementation.
> >>>>>>>>> 
> >>>>>>>>> For local LLM's, the likelihood of verbatim reproduction is 
> >>>>>>>>> differently but apparently comparably unlikely because they have 
> >>>>>>>>> far fewer parameters (32B vs. 671B for Deepseek for instance) of 
> >>>>>>>>> their pre-training corpus of trillions (30T in the case of 
> >>>>>>>>> Qwen3-32B for instance), so the individual tokens from the 
> >>>>>>>>> copyrighted material are highly unlikely to be actually stored in 
> >>>>>>>>> the model to be reproduced, and certainly not in sequence. They 
> >>>>>>>>> don't have the post-generation checks claimed by the SOTA models, 
> >>>>>>>>> but are apparently considered in the "< 1 in 10,000 completions 
> >>>>>>>>> will generate copyrighted code" territory.
> >>>>>>>>> 
> >>>>>>>>> When asked a human language prompt, or a multi-agent pipelined 
> >>>>>>>>> "still human language but from your architect agent" prompt, the 
> >>>>>>>>> likelihood of producing a string of copyrighted code in that manner 
> >>>>>>>>> is statistically very, very low. I think we're at far more risk of 
> >>>>>>>>> contributors copy/pasting stack overflow or code from other 
> >>>>>>>>> projects than we are from modern genAI models producing blocks of 
> >>>>>>>>> copyrighted code.
> >>>>>>>>> 
> >>>>>>>>> All of which comes back to: if people disclose if they used AI, 
> >>>>>>>>> what models, and whether they used the code or text the model wrote 
> >>>>>>>>> verbatim or used it as a scaffolding and then heavily modified 
> >>>>>>>>> everything I think we'll be in a pretty good spot.
> >>>>>>>>> 
> >>>>>>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> 2. Models that do not do output filtering to restrict the 
> >>>>>>>>>>> reproduction of training data unless the tool can ensure the 
> >>>>>>>>>>> output is license compatible?
> >>>>>>>>>>> 
> >>>>>>>>>>> 2 would basically prohibit locally run models.
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> I am not for this for the reasons listed above. There isn’t a 
> >>>>>>>>>> difference between this and a contributor copying code and sending 
> >>>>>>>>>> our way. We still need to validate the code can be accepted .
> >>>>>>>>>> 
> >>>>>>>>>> We also have the issue of having this be a broad stroke. If the 
> >>>>>>>>>> user asked a model to write a test for the code the human wrote, 
> >>>>>>>>>> we reject the contribution as they used a local model? This poses 
> >>>>>>>>>> very little copywriting risk yet our policy would now reject
> >>>>>>>>>> 
> >>>>>>>>>> Sent from my iPhone
> >>>>>>>>>> 
> >>>>>>>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <[email protected]> 
> >>>>>>>>>>> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>> 2. Models that do not do output filtering to restrict the 
> >>>>>>>>>>> reproduction of training data unless the tool can ensure the 
> >>>>>>>>>>> output is license compatible?
> >>>>>>>>>>> 
> >>>>>>>>>>> 2 would basically prohibit locally run models.
> >>>>>>>>> 
> >>>>>>>>> 
> >>> 
> > 
> 
>

Re: Accepting AI generated contributions

Reply via email to