Re: Accepting AI generated contributions

Josh McKenzie Tue, 24 Jun 2025 14:19:33 -0700

> These change very often and it's a constant moving target.
This is not hyperbole. This area is moving faster than anything I've seen 
before.


> I am in this camp at the moment, AI vs Human has the same problem for the 
> reviewer; we are supposed to be doing this, and blocking AI or putting new 
> rules around AI doesn't really change anything, we are still supposed to do 
> this work.  
+1. 

> I am personally in the stance that disclosure (which is the ASF policy) is 
> best for the time being; nothing in this thread has motivated me to change 
> the current policy.
Yep. Option 2 - guidance and disclosure makes the most sense to me after 
reading this thread.

On Tue, Jun 24, 2025, at 5:09 PM, David Capwell wrote:
> > It's not clear from that thread precisely what they are objecting to and 
> > whether it has changed (another challenge!)
> 
> That thread was last updated in 2023 and the current stance is just "tell 
> people which one you used, and make sure the output follows the 3 main 
> points".  
> 
> > We can make a best effort to just vet the ones that people actually want to 
> > widely use and refuse everything else and be better off than allowing 
> > people to use tools that are known not to be license compatible or make 
> > little/no effort to avoid reproducing large amounts of copyrighted code.
> 
> How often are we going to "vet" new tools?  These change very often and it's 
> a constant moving target.  Are we going to expect someone to do this vetting, 
> give the pros/cons of what has changed since the last vote, then revote every 
> 6 months?  What does "vet" even mean?  
> 
> > allowing people to use tools that are known not to be license compatible
> 
> Which tools are you referring to?  The major providers all document that the 
> output is owned by the entity that requested it.  
> 
> > make little/no effort to avoid reproducing large amounts of copyrighted 
> > code.
> 
> How do you go about qualifying that?  Which tools / services are you 
> referring to?  How do you go about evaluating them?
> 
> > If someone submits copyrighted code to the project, whether an AI generated 
> > it or they just grabbed it from a Google search, it’s on the project to try 
> > not to accept it.
> 
> I am in this camp at the moment, AI vs Human has the same problem for the 
> reviewer; we are supposed to be doing this, and blocking AI or putting new 
> rules around AI doesn't really change anything, we are still supposed to do 
> this work.  
> 
> > What would you want?
> 
> My vote would be on 2/3 given the list from Ariel.  But I am personally in 
> the stance that disclosure (which is the ASF policy) is best for the time 
> being; nothing in this thread has motivated me to change the current policy.
> 
> On Mon, Jun 16, 2025 at 4:21 PM Patrick McFadin <pmcfa...@gmail.com> wrote:
>> I'm on with the allow list(1) or option 2.  3 just isn't realistic anymore. 
>> 
>> Patrick
>> 
>> 
>> 
>> On Mon, Jun 16, 2025 at 3:09 PM Caleb Rackliffe <calebrackli...@gmail.com> 
>> wrote:
>>> I haven't participated much here, but my vote would be basically #1, i.e. 
>>> an "allow list" with a clear procedure for expansion.
>>> 
>>> On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>> __
>>>> Hi,
>>>> 
>>>> We could, but if the allow list is binding then it's still an allow list 
>>>> with some guidance on how to expand the allow list.
>>>> 
>>>> If it isn't binding then it's guidance so still option 2 really.
>>>> 
>>>> I think the key distinction to find some early consensus on if we do a 
>>>> binding allow list or guidance, and then we can iron out the guidance, but 
>>>> I think that will be less controversial to work out.
>>>> 
>>>> Or option 3 which is not accepting AI generated contributions. I think 
>>>> there are some with healthy skepticism of AI generated code, but so far I 
>>>> haven't met anyone who wants to forbid it entirely.
>>>> 
>>>> Ariel
>>>> 
>>>> On Mon, Jun 16, 2025, at 4:54 PM, Josh McKenzie wrote:
>>>>> Couldn't our official stance be a combination of 1 and 2? i.e. "Here's an 
>>>>> allow list. If you're using something not on that allow list, here's some 
>>>>> basic guidance and maybe let us know how you tried to mitigate some of 
>>>>> this risk so we can update our allow list w/some nuance".
>>>>> 
>>>>> On Mon, Jun 16, 2025, at 4:39 PM, Ariel Weisberg wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>>>>>>> Where are you getting this from?  From the OpenAI terms of use: 
>>>>>>> https://openai.com/policies/terms-of-use/
>>>>>> Direct from the ASF legal mailing list discussion I linked to in my 
>>>>>> original email calling this out 
>>>>>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 It's 
>>>>>> not clear from that thread precisely what they are objecting to and 
>>>>>> whether it has changed (another challenge!), but I believe it's 
>>>>>> restrictions on what you are allowed to do with the output of OpenAI 
>>>>>> models. And if you get the output via other service's it's under a 
>>>>>> different license and it's fine!
>>>>>> 
>>>>>> Already we are demonstrating that it is not trivial understand what is 
>>>>>> and isn't allowed 
>>>>>> 
>>>>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>>>>>>> I still maintain that trying to publish an exhaustive list of 
>>>>>>> acceptable tools does not seem reasonable.
>>>>>>> But I agree that giving people guidance is possible.  Maybe having a 
>>>>>>> statement in the contribution guidelines along the lines of:
>>>>>> The list doesn't need to be exhaustive. We are not required to accept AI 
>>>>>> generated code at all!
>>>>>> 
>>>>>> We can make a best effort to just vet the ones that people actually want 
>>>>>> to widely use and refuse everything else and be better off than allowing 
>>>>>> people to use tools that are known not to be license compatible or make 
>>>>>> little/no effort to avoid reproducing large amounts of copyrighted code.
>>>>>> 
>>>>>>> “Make sure your tools do X, here are some that at the time of being 
>>>>>>> added to this list did X, Tool A, Tool B …
>>>>>>> Here is a list of tools that at the time of being added to this list 
>>>>>>> did not satisfy X. Tool Z - reason why”
>>>>>> I would be fine with this as an outcome. If we voted with multiple 
>>>>>> options it wouldn't be my first choice.
>>>>>> 
>>>>>> This thread only has 4 participants so far so it's hard to get a signal 
>>>>>> on what people would want if we tried to vote.
>>>>>> 
>>>>>> David, Scott, anyone else if the options were:
>>>>>>  1. Allow list
>>>>>>  2. Basic guidance as suggested by Jeremiah, but primarily leave it up 
>>>>>> to contributor/reviewer
>>>>>>  3. Do nothing
>>>>>>  4. My choice isn't here
>>>>>> What would you want?
>>>>>> 
>>>>>> My vote in choice order is 1,2,3.
>>>>>> 
>>>>>> Ariel
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>>>>>>> 
>>>>>>>>  I respectfully mean that contributors, reviewers, and committers 
>>>>>>>> can't feasibly understand and enforce the ASF guidelines.
>>>>>>> If this is true, then the ASF is in a lot of trouble and you should 
>>>>>>> bring it up with the ASF board.
>>>>>>> Where are you getting this from?  From the OpenAI terms of use: 
>>>>>>> https://openai.com/policies/terms-of-use/
>>>>>>> 
>>>>>>>> We don't even necessarily need to be that restrictive beyond requiring 
>>>>>>>> tools that make at least some effort not to reproduce large amounts of 
>>>>>>>> copyrighted code that may/may not be license compatible or tools that 
>>>>>>>> are themselves not license compatible. This ends up encompassing most 
>>>>>>>> of the ones people want to use anyways.
>>>>>>> 
>>>>>>>> There is a non-zero amount we can do to educate and guide that would 
>>>>>>>> be better than pointing people to the ASF guidelines and leaving it at 
>>>>>>>> that.
>>>>>>> 
>>>>>>> I still maintain that trying to publish an exhaustive list of 
>>>>>>> acceptable tools does not seem reasonable.
>>>>>>> But I agree that giving people guidance is possible.  Maybe having a 
>>>>>>> statement in the contribution guidelines along the lines of:
>>>>>>> “Make sure your tools do X, here are some that at the time of being 
>>>>>>> added to this list did X, Tool A, Tool B …
>>>>>>> Here is a list of tools that at the time of being added to this list 
>>>>>>> did not satisfy X. Tool Z - reason why”
>>>>>>> 
>>>>>>> -Jeremiah
>>>>>>> 
>>>>>>> 
>>>>>>> On Jun 11, 2025 at 11:48:30 AM, Ariel Weisberg <ar...@weisberg.ws> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I am not saying you said it, but I respectfully mean that 
>>>>>>>> contributors, reviewers, and committers can't feasibly understand and 
>>>>>>>> enforce the ASF guidelines. We would be another link in a chain of 
>>>>>>>> people abdicating responsibility starting with LLM vendors serving up 
>>>>>>>> models that reproduce copyrighted code, then going to ASF legal which 
>>>>>>>> gives us guidelines without the tools to enforce those guidelines, and 
>>>>>>>> now we (the PMC) would be doing the same to contributors, reviewers, 
>>>>>>>> and committers.
>>>>>>>> 
>>>>>>>>> I don’t think anyone is going to be able to maintain and enforce a 
>>>>>>>>> list of acceptable tools for contributors to the project to stick to. 
>>>>>>>>> We can’t know what someone did on their laptop, all we can do is 
>>>>>>>>> evaluate the code they submit.
>>>>>>>> I agree we might not be able to do a perfect job at any aspect of 
>>>>>>>> trying to make sure that the code we accept is not problematic I some 
>>>>>>>> way, but that doesn't mean we shouldn't try?
>>>>>>>> 
>>>>>>>> We don't even necessarily need to be that restrictive beyond requiring 
>>>>>>>> tools that make at least some effort not to reproduce large amounts of 
>>>>>>>> copyrighted code that may/may not be license compatible or tools that 
>>>>>>>> are themselves not license compatible. This ends up encompassing most 
>>>>>>>> of the ones people want to use anyways.
>>>>>>>> 
>>>>>>>> How many people are aware that if you get code from OpenAI directly 
>>>>>>>> that the license isn't ASL compatible, but that if you get it via 
>>>>>>>> Microsoft services that use OpenAI models that it's ASL compatible? 
>>>>>>>> It's not in the ASF guidelines (it was but they removed it!).
>>>>>>>> 
>>>>>>>> How many people are aware that when people use locally run models 
>>>>>>>> there is no output filtering further increasing the odds of the model 
>>>>>>>> reproducing copyright encumbered code?
>>>>>>>> 
>>>>>>>> There is a non-zero amount we can do to educate and guide that would 
>>>>>>>> be better than pointing people to the ASF guidelines and leaving it at 
>>>>>>>> that.
>>>>>>>> 
>>>>>>>> The ASF guidelines themselves have suggestions like requiring people 
>>>>>>>> to say if they used AI and then which AI. I don't think it's very 
>>>>>>>> useful beyond checking license compatibility of the AI itself, but 
>>>>>>>> that is something we should be doing so it might as well be documented 
>>>>>>>> and included in the PR text.
>>>>>>>> 
>>>>>>>> Ariel
>>>>>>>> 
>>>>>>>> On Mon, Jun 2, 2025, at 7:54 PM, Jeremiah Jordan wrote:
>>>>>>>>> I don’t think I said we should abdicate responsibility?  I said the 
>>>>>>>>> key point is that contributors, and more importantly reviewers and 
>>>>>>>>> committers understand the ASF guidelines and hold all code to those 
>>>>>>>>> standards. Any suspect code should be blocked during review. As Roman 
>>>>>>>>> says in your quote, this isn’t about AI, it’s about copyright. If 
>>>>>>>>> someone submits copyrighted code to the project, whether an AI 
>>>>>>>>> generated it or they just grabbed it from a Google search, it’s on 
>>>>>>>>> the project to try not to accept it.
>>>>>>>>> I don’t think anyone is going to be able to maintain and enforce a 
>>>>>>>>> list of acceptable tools for contributors to the project to stick to. 
>>>>>>>>> We can’t know what someone did on their laptop, all we can do is 
>>>>>>>>> evaluate the code they submit.
>>>>>>>>> 
>>>>>>>>> -Jeremiah
>>>>>>>>> 
>>>>>>>>> On Mon, Jun 2, 2025 at 6:39 PM Ariel Weisberg <ar...@weisberg.ws> 
>>>>>>>>> wrote:
>>>>>>>>>> __
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> As PMC members/committers we aren't supposed to abdicate this to 
>>>>>>>>>> legal or to contributors. Despite the fact that we aren't equipped 
>>>>>>>>>> to solve this problem we are supposed to be making sure that code 
>>>>>>>>>> contributed is non-infringing.
>>>>>>>>>> 
>>>>>>>>>> This is a quotation from Roman Shaposhnik from this legal thread 
>>>>>>>>>> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd
>>>>>>>>>> 
>>>>>>>>>>> Yes, because you have to. Again -- forget about AI -- if a drive-by 
>>>>>>>>>>> contributor submits a patch that has huge amounts of code stolen 
>>>>>>>>>>> from some existing copyright holder -- it is very much ON YOU as a 
>>>>>>>>>>> committer/PMC to prevent that from happening.
>>>>>>>>>> 
>>>>>>>>>> We aren't supposed to knowingly allow people to use AI tools that 
>>>>>>>>>> are known to generate infringing contributions or contributions 
>>>>>>>>>> which are not license compatible (such as OpenAI terms of use).
>>>>>>>>>> 
>>>>>>>>>> Ariel
>>>>>>>>>> On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote:
>>>>>>>>>>> > Ultimately it's the contributor's (and committer's) job to ensure 
>>>>>>>>>>> > that their contributions meet the bar for acceptance
>>>>>>>>>>> To me this is the key point. Given how pervasive this stuff is 
>>>>>>>>>>> becoming, I don’t think it’s feasible to make some list of tools 
>>>>>>>>>>> and enforce it.  Even without getting into extra tools, IDEs 
>>>>>>>>>>> (including IntelliJ) are doing more and more LLM based code 
>>>>>>>>>>> suggestion as time goes on.
>>>>>>>>>>> I think we should point people to the ASF Guidelines around such 
>>>>>>>>>>> tools, and the guidelines around copyrighted code, and then 
>>>>>>>>>>> continue to review patches with the high standards we have always 
>>>>>>>>>>> had in this project.
>>>>>>>>>>> 
>>>>>>>>>>> -Jeremiah
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg <ar...@weisberg.ws> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> __
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> To clarify are you saying that we should not accept AI generated 
>>>>>>>>>>>> code until it has been looked at by a human and then written again 
>>>>>>>>>>>> with different "wording" to ensure that it doesn't directly copy 
>>>>>>>>>>>> anything?
>>>>>>>>>>>> 
>>>>>>>>>>>> Or do you mean something else about the quality of "vibe coding" 
>>>>>>>>>>>> and how we shouldn't allow it because it makes bad code? 
>>>>>>>>>>>> Ultimately it's the contributor's (and committer's) job to ensure 
>>>>>>>>>>>> that their contributions meet the bar for acceptance and I don't 
>>>>>>>>>>>> think we should tell them how to go about meeting that bar beyond 
>>>>>>>>>>>> what is needed to address the copyright concern.
>>>>>>>>>>>> 
>>>>>>>>>>>> I agree that the bar set by the Apache guidelines are pretty high. 
>>>>>>>>>>>> They are simultaneously impossible and trivial to meet depending 
>>>>>>>>>>>> on how you interpret them and we are not very well equipped to 
>>>>>>>>>>>> interpret them.
>>>>>>>>>>>> 
>>>>>>>>>>>> It would have been more straightforward for them to simply say no, 
>>>>>>>>>>>> but they didn't opt to do that as if there is some way for PMCs to 
>>>>>>>>>>>> acceptably take AI generated contributions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Ariel
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote:
>>>>>>>>>>>>>> fine tuning encourage not reproducing things verbatimI think not 
>>>>>>>>>>>>>> producing copyrighted output from your training data is a 
>>>>>>>>>>>>>> technically feasible achievement for these vendors so I have a 
>>>>>>>>>>>>>> moderate level of trust they will succeed at it if they say they 
>>>>>>>>>>>>>> do it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Some team members and I discussed this in the context of my 
>>>>>>>>>>>>> documentation patch (which utilized Claude during composition). I 
>>>>>>>>>>>>> conducted an experiment to pose high-level Cassandra-related 
>>>>>>>>>>>>> questions to a model without additional context, while adjusting 
>>>>>>>>>>>>> the temperature parameter (tested at 0.2, 0.5, and 0.8). The 
>>>>>>>>>>>>> results revealed that each test generated content copied verbatim 
>>>>>>>>>>>>> from a specific non-Apache (and non-DSE) website. I did not 
>>>>>>>>>>>>> verify whether this content was copyrighted, though it was easily 
>>>>>>>>>>>>> identifiable through a simple Google search. This occurred as a 
>>>>>>>>>>>>> single sentence within the generated document, and as I am not a 
>>>>>>>>>>>>> legal expert, I cannot determine whether this constitutes a 
>>>>>>>>>>>>> significant issue.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The complexity increases when considering models trained on 
>>>>>>>>>>>>> different languages, which may translate content into English. In 
>>>>>>>>>>>>> such cases, a Google search would fail to detect the origin. Is 
>>>>>>>>>>>>> this still considered plagiarism? Does it violate copyright laws? 
>>>>>>>>>>>>> I am uncertain.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Similar challenges arise with code generation. For instance, if a 
>>>>>>>>>>>>> model is trained on a GPL-licensed Python library that implements 
>>>>>>>>>>>>> a novel data structure, and the model subsequently rewrites this 
>>>>>>>>>>>>> structure in Java, a Google search is unlikely to identify the 
>>>>>>>>>>>>> source.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Personally, I do not assume these models will avoid producing 
>>>>>>>>>>>>> copyrighted material. This doesn’t mean I am against AI at all, 
>>>>>>>>>>>>> but rather reflects my belief that the requirements set by Apache 
>>>>>>>>>>>>> are not easily “provable” in such scenarios.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My personal opinion is that we should at least consider allow 
>>>>>>>>>>>>>> listing a few specific sources (any vendor that scans output for 
>>>>>>>>>>>>>> infringement) and add that to the PR template and in other 
>>>>>>>>>>>>>> locations (readme, web site). Bonus points if we can set up code 
>>>>>>>>>>>>>> scanning (useful for non-AI contributions!).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My perspective, after trying to see what AI can do is the 
>>>>>>>>>>>>> following:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Strengths
>>>>>>>>>>>>> * Generating a preliminary draft of a document and assisting with 
>>>>>>>>>>>>> iterative revisions 
>>>>>>>>>>>>> * Documenting individual methods 
>>>>>>>>>>>>> * Generation of “simple” methods and scripts, provided the 
>>>>>>>>>>>>> underlying libraries are well-documented in public repositories 
>>>>>>>>>>>>> * Managing repetitive or procedural tasks, such as “migrating 
>>>>>>>>>>>>> from X to Y” or “converting serializations to the X interface”
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Limitations
>>>>>>>>>>>>> * Producing a fully functional document in a single attempt that 
>>>>>>>>>>>>> meets merge standards. When documenting Gens.java and 
>>>>>>>>>>>>> Property.java, the output appeared plausible but contained 
>>>>>>>>>>>>> frequent inaccuracies.
>>>>>>>>>>>>> * Addressing complex or ambiguous scenarios (“gossip”), though 
>>>>>>>>>>>>> this challenge is not unique to AI—Matt Byrd and I tested Claude 
>>>>>>>>>>>>> for CASSANDRA-20659, where it could identify relevant code but 
>>>>>>>>>>>>> proposed solutions that risked corrupting production clusters. 
>>>>>>>>>>>>> * Interpreting large-scale codebases. Beyond approximately 300 
>>>>>>>>>>>>> lines of actual code (excluding formatting), performance degrades 
>>>>>>>>>>>>> significantly, leading to a marked decline in output quality.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Note: When referring to AI/LLMs, I am not discussing interactions 
>>>>>>>>>>>>> with a user interface to execute specific tasks, but rather 
>>>>>>>>>>>>> leveraging code agents like Roo and Aider to provide contextual 
>>>>>>>>>>>>> information to the LLM.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Given these observations, it remains challenging to determine 
>>>>>>>>>>>>> optimal practices. In some contexts its very clear to tell that 
>>>>>>>>>>>>> nothing was taking from external work (e.g., “create a test using 
>>>>>>>>>>>>> our BTree class that inserts a row with a null column,” “analyze 
>>>>>>>>>>>>> this function’s purpose”). However, for substantial tasks, the 
>>>>>>>>>>>>> situation becomes more complex. If the author employed AI as a 
>>>>>>>>>>>>> collaborative tool during “pair programming,” concerns are not 
>>>>>>>>>>>>> really that different than google searches (unless the work 
>>>>>>>>>>>>> involves unique elements like introducing new data structures or 
>>>>>>>>>>>>> indexes). Conversely, if the author “vibe coded” the entire 
>>>>>>>>>>>>> patch, two primary concerns arise: does the author have writes to 
>>>>>>>>>>>>> the code and whether its quality aligns with requirements.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> TL;DR - I am not against AI contributions, but strongly prefer 
>>>>>>>>>>>>> its done as “pair programing”.  My experience with “vibe coding” 
>>>>>>>>>>>>> makes me worry about the quality of the code, and that the author 
>>>>>>>>>>>>> is less likely to validate that the code generated is safe to 
>>>>>>>>>>>>> donate.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This email was generated with the help of AI =)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On May 30, 2025, at 3:00 PM, Ariel Weisberg <ar...@weisberg.ws> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It looks like we haven't discussed this much and haven't settled 
>>>>>>>>>>>>>> on a policy for what kinds of AI generated contributions we 
>>>>>>>>>>>>>> accept and what vetting is required for them.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> Given the above, code generated in whole or in part using AI can 
>>>>>>>>>>>>>> be contributed if the contributor ensures that:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. The terms and conditions of the generative AI tool do not 
>>>>>>>>>>>>>> place any restrictions on use of the output that would be 
>>>>>>>>>>>>>> inconsistent with the Open Source Definition.
>>>>>>>>>>>>>> 2. At least one of the following conditions is met:
>>>>>>>>>>>>>>    2.1 The output is not copyrightable subject matter (and would 
>>>>>>>>>>>>>> not be even if produced by a human).
>>>>>>>>>>>>>>    2.2 No third party materials are included in the output.
>>>>>>>>>>>>>>    2.3 Any third party materials that are included in the output 
>>>>>>>>>>>>>> are being used with permission (e.g., under a compatible 
>>>>>>>>>>>>>> open-source license) of the third party copyright holders and in 
>>>>>>>>>>>>>> compliance with the applicable license terms.
>>>>>>>>>>>>>> 3. A contributor obtains reasonable certainty that conditions 
>>>>>>>>>>>>>> 2.2 or 2.3 are met if the AI tool itself provides sufficient 
>>>>>>>>>>>>>> information about output that may be similar to training data, 
>>>>>>>>>>>>>> or from code scanning results.
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There is a lot to unpack there, but it seems like any one of 2 
>>>>>>>>>>>>>> needs to be met, and 3 describes how 2.2 and 2.3 can be 
>>>>>>>>>>>>>> satisfied.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2.1 is tricky as we are not copyright lawyers, and 2.2 and 2.3 
>>>>>>>>>>>>>> is a pretty high bar in that it's hard to know if you have met 
>>>>>>>>>>>>>> it. Do we have anyone in the community running any code scanning 
>>>>>>>>>>>>>> tools already?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is the JIRA for addition of the generative AI policy: 
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/LEGAL-631
>>>>>>>>>>>>>> Legal mailing list discussion of the policy: 
>>>>>>>>>>>>>> https://lists.apache.org/thread/vw3jf4726yrhovg39mcz1y89mx8j4t8s
>>>>>>>>>>>>>> Legal mailing list discussion of compliant tools: 
>>>>>>>>>>>>>> https://lists.apache.org/thread/nzyl311q53xhpq99grf6l1h076lgzybr
>>>>>>>>>>>>>> Legal mailing list discussion about how Open AI terms are not 
>>>>>>>>>>>>>> Apache compatible: 
>>>>>>>>>>>>>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16
>>>>>>>>>>>>>> Hadoop mailing list message hinting that they accept 
>>>>>>>>>>>>>> contributions but ask which tool: 
>>>>>>>>>>>>>> https://lists.apache.org/thread/bgs8x1f9ovrjmhg6b450bz8bt7o43yxj
>>>>>>>>>>>>>> Spark mailing list message where they have given up on stopping 
>>>>>>>>>>>>>> people: 
>>>>>>>>>>>>>> https://lists.apache.org/thread/h6621sxfxcnnpsoyr31x65z207kk80fr
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I didn't see other projects discussing and deciding how to 
>>>>>>>>>>>>>> handle these contributions, but I also didn't check that many of 
>>>>>>>>>>>>>> them only Hadoop, Spark, Druid, Pulsar. I also can't see their 
>>>>>>>>>>>>>> PMC mailing list.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I asked O3 to deep research what is done to avoid producing 
>>>>>>>>>>>>>> copyrighted code: 
>>>>>>>>>>>>>> https://chatgpt.com/share/683a2983-dd9c-8009-9a66-425012af840d
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> To summarize training deduplicates training so the model is less 
>>>>>>>>>>>>>> likely to spit reproduce it verbatim, prompts and fine tuning 
>>>>>>>>>>>>>> encourage not reproducing things verbatim, the inference is 
>>>>>>>>>>>>>> biased to not pick the best option but some neighboring one 
>>>>>>>>>>>>>> encouraging originality, and in some instances the output is 
>>>>>>>>>>>>>> checked to make sure it doesn't match the training data. So to 
>>>>>>>>>>>>>> some extent 2.2 is being done to different degrees depending on 
>>>>>>>>>>>>>> what product you are using.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It's worth noting that scanning the output can be probabilistic 
>>>>>>>>>>>>>> in the case of say Anthropic and they still recommend code 
>>>>>>>>>>>>>> scanning.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Quite notably Anthropic for its enterprise users indemnifies 
>>>>>>>>>>>>>> them against copyright claims. It's not perfect, but it does 
>>>>>>>>>>>>>> mean they have an incentive to make sure there are fewer 
>>>>>>>>>>>>>> copyright claims. We could choose to be picky and only accept 
>>>>>>>>>>>>>> specific sources of LLM generated code based on perceived safety.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think not producing copyrighted output from your training data 
>>>>>>>>>>>>>> is a technically feasible achievement for these vendors so I 
>>>>>>>>>>>>>> have a moderate level of trust they will succeed at it if they 
>>>>>>>>>>>>>> say they do it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I could send a message to the legal list asking for 
>>>>>>>>>>>>>> clarification and a set of tools, but based on Roman's 
>>>>>>>>>>>>>> communication 
>>>>>>>>>>>>>> (https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd)
>>>>>>>>>>>>>>  I think this is kind of what we get. It's on us to ensure the 
>>>>>>>>>>>>>> contributions are kosher either by code scanning or accepting 
>>>>>>>>>>>>>> that the LLM vendors are doing a good job at avoiding 
>>>>>>>>>>>>>> copyrighted output.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My personal opinion is that we should at least consider allow 
>>>>>>>>>>>>>> listing a few specific sources (any vendor that scans output for 
>>>>>>>>>>>>>> infringement) and add that to the PR template and in other 
>>>>>>>>>>>>>> locations (readme, web site). Bonus points if we can set up code 
>>>>>>>>>>>>>> scanning (useful for non-AI contributions!).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Ariel
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>

Re: Accepting AI generated contributions

Reply via email to