> On Aug 1, 2025, at 6:38 AM, Josh McKenzie <jmcken...@apache.org
<mailto:jmcken...@apache.org>> wrote:
>
>> Kimi K2 has similar wording as OpenAI so I assume they are banned
as well?
>
> What about the terms is incompatible with the ASF? Looks like you're
good to go with whatever you generate?
Go to the "Service Misuse.” section
• By using Kimi to develop, train, or improve algorithms, models,
etc., that are in direct or indirect competition with us;
This is the type of wording that caused OpenAI to be excluded.
In https://www.apache.org/legal/generative-tooling.html <https://
www.apache.org/legal/generative-tooling.html>
• The terms and conditions of the generative AI tool do not
place any restrictions on use of the output that would be inconsistent
with the Open Source Definition. https://opensource.org/osd/ <https://
opensource.org/osd/>
Its very possible I misunderstood, but the legal thread called out
OpenAI for similar wording which caused them to add this specific
section to cover.
>
>> The ownership of the content generated based on Kimi is maintained
by you, and you are responsible for its independent judgment and use.
Any intellectual property issues arising from the use of content
generated by Kimi are handled by you, and we are not responsible for
any losses caused thereby. If you cause any loss to us, we have the
right to recover from you.
> Is there any concern about the transference of "cause any loss to
us, we have the right to recover from you."?
>
> Regarding OpenAI's terms, what aspects of them are problematic for
ASF donation? Reading through that now:
>> Use Output to develop models that compete with OpenAI.
>
> I don't think that someone in the future using Cassandra with code
generated by OpenAI would qualify. i.e. this can't be transitive else
it'd poison SBOM's everywhere w/dependency chains.
>
>> Ownership of content. As between you and OpenAI, and to the extent
permitted by applicable law, you (a) retain your ownership rights in
Input and (b) own the Output. We hereby assign to you all our right,
title, and interest, if any, in and to Output.
> Seems compatible?
Ownership isn’t the concern, so many patches would not be inventing
new data structures and likely glueing Cassandra code together, so
wouldn’t be copy writable (there may be cases, but we should treat
those in isolation as the same concern happens with human code).
OpenAI called out that you can’t use their output to build something
to compete with them, which is what was flagged as being against the
“Open Source Definition”.
>
> On Fri, Aug 1, 2025, at 6:22 AM, David Capwell wrote:
>>
>>
>> Not really, this thread has made me really see that we need to know
the tool/model provider so we can confirm the TOC allows contributions.
>>
>> OpenAI is not allowed and we know most popular ones are, but what
about new ones? Kimi K2 has similar wording as OpenAI so I assume
they are banned as well?
>>
>> https://kimi.moonshot.cn/user/agreement/modelUse <https://
kimi.moonshot.cn/user/agreement/modelUse>
>>
>> I don’t know of any tool that currently is excluded, it’s only been
model providers so far… but this is moving fast so should also know
the tool
>>
>> The best compromise is having a auto approve allow list, if on that
list don’t need to disclose?
>>
>> Sent from my iPhone
>>
>>> On Jul 31, 2025, at 9:13 PM, Yifan Cai <yc25c...@gmail.com
<mailto:yc25c...@gmail.com>> wrote:
>>>
>>> Does "optionally disclose the LLM used in whatever way you prefer
and definitely no OpenAI" meet everyone's expectations?
>>>
>>> - Yifan
>>>
>>> On Thu, Jul 31, 2025 at 1:56 PM Josh McKenzie
<jmcken...@apache.org <mailto:jmcken...@apache.org>> wrote:
>>>
>>> Do we have a consensus on this topic or is there still further
discussion to be had?
>>>
>>> On Thu, Jul 24, 2025, at 8:26 AM, David Capwell wrote:
>>>>
>>>>
>>>> Given the above, code generated in whole or in part using AI can
be contributed if the contributor ensures that: The terms and
conditions of the generative AI tool do not place any restrictions on
use of the output that would be inconsistent with the Open Source
Definition. At least one of the following conditions is met: The
output is not copyrightable subject matter (and would not be even if
produced by a human). No third party materials are included in the
output. Any third party materials that are included in the output are
being used with permission (e.g., under a compatible open-source
license) of the third party copyright holders and in compliance with
the applicable license terms. A contributor obtains reasonable
certainty that conditions 2.2 or 2.3 are met if the AI tool itself
provides sufficient information about output that may be similar to
training data, or from code scanning results
>>>> ASF Generative Tooling Guidance
>>>> apache.org <apple-touch-icon-180x180.png>
>>>>
>>>>
>>>> Ariel shared this at the start. Right now we must know what tool
was used so we can make sure its license is ok. The only tool
currently flagged as not acceptable is OpenAI as it has wordings
limiting what you may do with its output.
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Jul 23, 2025, at 1:31 PM, Jon Haddad <j...@rustyrazorblade.com
<mailto:j...@rustyrazorblade.com>> wrote:
>>>>>
>>>>> +1 to Patrick's proposal.
>>>>>
>>>>> On Wed, Jul 23, 2025 at 12:37 PM Patrick McFadin
<pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>> wrote:
>>>>> I just did some review on all the case law around copywrite and
AI code. So far, every claim has been dismissed. There are some other
cases like NYTimes which have more merit and are proceeding.
>>>>>
>>>>> Which leads me to the opinion that this is feeling like a
premature optimization. Somebody creating a PR should not have to also
submit a SBOM, which is essentially what we’re asking. It’s undue
burden and friction on the process when we should be looking for ways
to reduce friction.
>>>>>
>>>>> My proposal is no disclosures required.
>>>>>
>>>>> On Wed, Jul 23, 2025 at 12:06 PM Yifan Cai <yc25c...@gmail.com
<mailto:yc25c...@gmail.com>> wrote:
>>>>> According to the thread, the disclosure is for legal purposes.
For example, the patch is not produced by OpenAI's service. I think
having the discussion to clarify the AI usage in the projects is
meaningful. I guess many are hesitating because of the unclarity in
the area.
>>>>>
>>>>> > I don’t believe or agree with us assuming we should do this
for every PR
>>>>>
>>>>> I am with you, David. Updating the mail list for PRs is
overwhelming for both the author and the community.
>>>>>
>>>>> I also do not feel co-author is the best place.
>>>>>
>>>>> - Yifan
>>>>>
>>>>> On Wed, Jul 23, 2025 at 11:51 AM Patrick McFadin
<pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>> wrote:
>>>>> This is starting to get ridiculous. Disclosure statements on
exactly how a problem was solved? What’s next? Time cards?
>>>>>
>>>>> It’s time to accept the world as it is. AI is in the coding
toolbox now just like IDEs, linters and code formatters. Some may not
like using them, some may love using them. What matters is that a
problem was solved, the code matches whatever quality standard the
project upholds which should be enforced by testing and code reviews.
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Wed, Jul 23, 2025 at 11:31 AM David Capwell
<dcapw...@apple.com <mailto:dcapw...@apple.com>> wrote:
>>>>>>
>>>>>>
>>>>>> David is disclosing it in the maillist and the GH page. Should
the disclosure be persisted in the commit?
>>>>>
>>>>>
>>>>> Someone asked me to update the ML, but I don’t believe or agree
with us assuming we should do this for every PR; personally storing
this in the PR description is fine to me as you are telling the
reviewers (who you need to communicate this to).
>>>>>
>>>>>
>>>>>> I’d say we can use the co-authored part of our commit messages
to disclose the actual AI that was used?
>>>>>
>>>>>
>>>>> Heh... I kinda feel dirty doing that… No one does that when they
take something from a blog or stack overflow, but when you do that you
should still attribute by linking… which I guess is what Co-Authored does?
>>>>>
>>>>> I don’t know… feels dirty...
>>>>>
>>>>>
>>>>>> On Jul 23, 2025, at 11:19 AM, Bernardo Botella
<conta...@bernardobotella.com <mailto:conta...@bernardobotella.com>>
wrote:
>>>>>>
>>>>>> That’s a great point. I’d say we can use the co-authored part
of our commit messages to disclose the actual AI that was used?
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com
<mailto:yc25c...@gmail.com>> wrote:
>>>>>>>
>>>>>>> Curious, what are the good ways to disclose the information?
>>>>>>>
>>>>>>> > All of which comes back to: if people disclose if they used
AI, what models, and whether they used the code or text the model
wrote verbatim or used it as a scaffolding and then heavily modified
everything I think we'll be in a pretty good spot.
>>>>>>>
>>>>>>> David is disclosing it in the maillist and the GH page. Should
the disclosure be persisted in the commit?
>>>>>>>
>>>>>>> - Yifan
>>>>>>>
>>>>>>> On Wed, Jul 23, 2025 at 8:47 AM David Capwell
<dcapw...@apple.com <mailto:dcapw...@apple.com>> wrote:
>>>>>>> Sent out this patch that was written 100% by Claude: https://
github.com/apache/cassandra/pull/4266 <https://github.com/apache/
cassandra/pull/4266>
>>>>>>>
>>>>>>> Claudes license doesn’t have issues with the current ASF
policy as far as I can tell. If you look at the patch it’s very clear
there isn’t any copywriter material (its glueing together C* classes).
>>>>>>>
>>>>>>> I could have written this my self but I had to focus on code
reviews and also needed this patch out, so asked Claude to write it
for me so I could focus on reviews. I have reviewed it myself and
it’s basically the same code I would have written (notice how small
and focused the patch is, larger stuff doesn’t normally pass my peer
review).
>>>>>>>
>>>>>>>> On Jun 25, 2025, at 2:37 PM, David Capwell
<dcapw...@apple.com <mailto:dcapw...@apple.com>> wrote:
>>>>>>>>
>>>>>>>> +1 to what Josh said
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie
<jmcken...@apache.org <mailto:jmcken...@apache.org>> wrote:
>>>>>>>>>
>>>>>>>>> Did some more digging. Apparently the way a lot of
headline-grabbers have been making models reproduce code verbatim is
to prompt them with dozens of verbatim tokens of copyrighted code as
input where completion is then very heavily weighted to regurgitate
the initial implementation. Which makes sense; if you copy/paste 100
lines of copyrighted code, the statistically likely completion for
that will be that initial implementation.
>>>>>>>>>
>>>>>>>>> For local LLM's, the likelihood of verbatim reproduction is
differently but apparently comparably unlikely because they have far
fewer parameters (32B vs. 671B for Deepseek for instance) of their
pre-training corpus of trillions (30T in the case of Qwen3-32B for
instance), so the individual tokens from the copyrighted material are
highly unlikely to be actually stored in the model to be reproduced,
and certainly not in sequence. They don't have the post-generation
checks claimed by the SOTA models, but are apparently considered in
the "< 1 in 10,000 completions will generate copyrighted code" territory.
>>>>>>>>>
>>>>>>>>> When asked a human language prompt, or a multi-agent
pipelined "still human language but from your architect agent" prompt,
the likelihood of producing a string of copyrighted code in that
manner is statistically very, very low. I think we're at far more risk
of contributors copy/pasting stack overflow or code from other
projects than we are from modern genAI models producing blocks of
copyrighted code.
>>>>>>>>>
>>>>>>>>> All of which comes back to: if people disclose if they used
AI, what models, and whether they used the code or text the model
wrote verbatim or used it as a scaffolding and then heavily modified
everything I think we'll be in a pretty good spot.
>>>>>>>>>
>>>>>>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 2. Models that do not do output filtering to restrict the
reproduction of training data unless the tool can ensure the output is
license compatible?
>>>>>>>>>>>
>>>>>>>>>>> 2 would basically prohibit locally run models.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am not for this for the reasons listed above. There isn’t
a difference between this and a contributor copying code and sending
our way. We still need to validate the code can be accepted .
>>>>>>>>>>
>>>>>>>>>> We also have the issue of having this be a broad stroke. If
the user asked a model to write a test for the code the human wrote,
we reject the contribution as they used a local model? This poses very
little copywriting risk yet our policy would now reject
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg
<ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> 2. Models that do not do output filtering to restrict the
reproduction of training data unless the tool can ensure the output is
license compatible?
>>>>>>>>>>>
>>>>>>>>>>> 2 would basically prohibit locally run models.
>>>>>>>>>
>>>>>>>>>
>>>
>