Hi, OK, so exclude instead of allow is an option 5 and I assume this would be combined with a requiring people to identify when they used generative AI and what they used? Seems like there is broad support for that and it is the ASF recommendation.
Would the starting point for an exclusion list be something like: 1. OpenAI models accessed via a platform where the licensing is not OSS compatible (like through OpenAI directly which prohibits using output to develop models that compete with OpenAI) 2. Models that do not do output filtering to restrict the reproduction of training data unless the tool can ensure the output is license compatible? 2 would basically prohibit locally run models. Ariel On Tue, Jun 24, 2025, at 6:50 PM, David Capwell wrote: > Spoke with Ariel in slack. > > https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 > > > Not sure If I am missing more statements, but the one I can find was that the > TOC of OpenAI has the following words: "Use Output to develop models that > compete with OpenAI” (See here > https://openai.com/policies/terms-of-use/). That thread then updated > https://www.apache.org/legal/generative-tooling.html with the wording "The > terms and conditions of the generative AI tool do not place any restrictions > on use of the output that would be inconsistent with the Open Source > Definition.” And argues that that wording is what causes issues. > > > I am fine having an exclude list, where we can add tools / services that we > can reference what exactly in the TOC is in violation; that way changes to > the TOC can trigger removal from the list. The “why” and how to get removed > should be very clear and non-debatable. > > >> On Jun 24, 2025, at 2:19 PM, Josh McKenzie <jmcken...@apache.org> wrote: >> >>> These change very often and it's a constant moving target. >> This is not hyperbole. This area is moving faster than anything I've seen >> before. >> >>> I am in this camp at the moment, AI vs Human has the same problem for the >>> reviewer; we are supposed to be doing this, and blocking AI or putting new >>> rules around AI doesn't really change anything, we are still supposed to do >>> this work. >> +1. >> >>> I am personally in the stance that disclosure (which is the ASF policy) is >>> best for the time being; nothing in this thread has motivated me to change >>> the current policy. >> Yep. Option 2 - guidance and disclosure makes the most sense to me after >> reading this thread. >> >> On Tue, Jun 24, 2025, at 5:09 PM, David Capwell wrote: >>> > It's not clear from that thread precisely what they are objecting to and >>> > whether it has changed (another challenge!) >>> >>> That thread was last updated in 2023 and the current stance is just "tell >>> people which one you used, and make sure the output follows the 3 main >>> points". >>> >>> > We can make a best effort to just vet the ones that people actually want >>> > to widely use and refuse everything else and be better off than allowing >>> > people to use tools that are known not to be license compatible or make >>> > little/no effort to avoid reproducing large amounts of copyrighted code. >>> >>> How often are we going to "vet" new tools? These change very often and >>> it's a constant moving target. Are we going to expect someone to do this >>> vetting, give the pros/cons of what has changed since the last vote, then >>> revote every 6 months? What does "vet" even mean? >>> >>> > allowing people to use tools that are known not to be license compatible >>> >>> Which tools are you referring to? The major providers all document that >>> the output is owned by the entity that requested it. >>> >>> > make little/no effort to avoid reproducing large amounts of copyrighted >>> > code. >>> >>> How do you go about qualifying that? Which tools / services are you >>> referring to? How do you go about evaluating them? >>> >>> > If someone submits copyrighted code to the project, whether an AI >>> > generated it or they just grabbed it from a Google search, it’s on the >>> > project to try not to accept it. >>> >>> I am in this camp at the moment, AI vs Human has the same problem for the >>> reviewer; we are supposed to be doing this, and blocking AI or putting new >>> rules around AI doesn't really change anything, we are still supposed to do >>> this work. >>> >>> > What would you want? >>> >>> My vote would be on 2/3 given the list from Ariel. But I am personally in >>> the stance that disclosure (which is the ASF policy) is best for the time >>> being; nothing in this thread has motivated me to change the current policy. >>> >>> On Mon, Jun 16, 2025 at 4:21 PM Patrick McFadin <pmcfa...@gmail.com> wrote: >>>> I'm on with the allow list(1) or option 2. 3 just isn't realistic >>>> anymore. >>>> >>>> Patrick >>>> >>>> >>>> >>>> On Mon, Jun 16, 2025 at 3:09 PM Caleb Rackliffe <calebrackli...@gmail.com> >>>> wrote: >>>>> I haven't participated much here, but my vote would be basically #1, i.e. >>>>> an "allow list" with a clear procedure for expansion. >>>>> >>>>> On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg <ar...@weisberg.ws> wrote: >>>>>> __ >>>>>> Hi, >>>>>> >>>>>> We could, but if the allow list is binding then it's still an allow list >>>>>> with some guidance on how to expand the allow list. >>>>>> >>>>>> If it isn't binding then it's guidance so still option 2 really. >>>>>> >>>>>> I think the key distinction to find some early consensus on if we do a >>>>>> binding allow list or guidance, and then we can iron out the guidance, >>>>>> but I think that will be less controversial to work out. >>>>>> >>>>>> Or option 3 which is not accepting AI generated contributions. I think >>>>>> there are some with healthy skepticism of AI generated code, but so far >>>>>> I haven't met anyone who wants to forbid it entirely. >>>>>> >>>>>> Ariel >>>>>> >>>>>> On Mon, Jun 16, 2025, at 4:54 PM, Josh McKenzie wrote: >>>>>>> Couldn't our official stance be a combination of 1 and 2? i.e. "Here's >>>>>>> an allow list. If you're using something not on that allow list, here's >>>>>>> some basic guidance and maybe let us know how you tried to mitigate >>>>>>> some of this risk so we can update our allow list w/some nuance". >>>>>>> >>>>>>> On Mon, Jun 16, 2025, at 4:39 PM, Ariel Weisberg wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: >>>>>>>>> Where are you getting this from? From the OpenAI terms of use: >>>>>>>>> https://openai.com/policies/terms-of-use/ >>>>>>>> Direct from the ASF legal mailing list discussion I linked to in my >>>>>>>> original email calling this out >>>>>>>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 It's >>>>>>>> not clear from that thread precisely what they are objecting to and >>>>>>>> whether it has changed (another challenge!), but I believe it's >>>>>>>> restrictions on what you are allowed to do with the output of OpenAI >>>>>>>> models. And if you get the output via other service's it's under a >>>>>>>> different license and it's fine! >>>>>>>> >>>>>>>> Already we are demonstrating that it is not trivial understand what is >>>>>>>> and isn't allowed >>>>>>>> >>>>>>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: >>>>>>>>> I still maintain that trying to publish an exhaustive list of >>>>>>>>> acceptable tools does not seem reasonable. >>>>>>>>> But I agree that giving people guidance is possible. Maybe having a >>>>>>>>> statement in the contribution guidelines along the lines of: >>>>>>>> The list doesn't need to be exhaustive. We are not required to accept >>>>>>>> AI generated code at all! >>>>>>>> >>>>>>>> We can make a best effort to just vet the ones that people actually >>>>>>>> want to widely use and refuse everything else and be better off than >>>>>>>> allowing people to use tools that are known not to be license >>>>>>>> compatible or make little/no effort to avoid reproducing large amounts >>>>>>>> of copyrighted code. >>>>>>>> >>>>>>>>> “Make sure your tools do X, here are some that at the time of being >>>>>>>>> added to this list did X, Tool A, Tool B … >>>>>>>>> Here is a list of tools that at the time of being added to this list >>>>>>>>> did not satisfy X. Tool Z - reason why” >>>>>>>> I would be fine with this as an outcome. If we voted with multiple >>>>>>>> options it wouldn't be my first choice. >>>>>>>> >>>>>>>> This thread only has 4 participants so far so it's hard to get a >>>>>>>> signal on what people would want if we tried to vote. >>>>>>>> >>>>>>>> David, Scott, anyone else if the options were: >>>>>>>> 1. Allow list >>>>>>>> 2. Basic guidance as suggested by Jeremiah, but primarily leave it up >>>>>>>> to contributor/reviewer >>>>>>>> 3. Do nothing >>>>>>>> 4. My choice isn't here >>>>>>>> What would you want? >>>>>>>> >>>>>>>> My vote in choice order is 1,2,3. >>>>>>>> >>>>>>>> Ariel >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: >>>>>>>>> >>>>>>>>>> I respectfully mean that contributors, reviewers, and committers >>>>>>>>>> can't feasibly understand and enforce the ASF guidelines. >>>>>>>>> If this is true, then the ASF is in a lot of trouble and you should >>>>>>>>> bring it up with the ASF board. >>>>>>>>> Where are you getting this from? From the OpenAI terms of use: >>>>>>>>> https://openai.com/policies/terms-of-use/ >>>>>>>>> >>>>>>>>>> We don't even necessarily need to be that restrictive beyond >>>>>>>>>> requiring tools that make at least some effort not to reproduce >>>>>>>>>> large amounts of copyrighted code that may/may not be license >>>>>>>>>> compatible or tools that are themselves not license compatible. This >>>>>>>>>> ends up encompassing most of the ones people want to use anyways. >>>>>>>>> >>>>>>>>>> There is a non-zero amount we can do to educate and guide that would >>>>>>>>>> be better than pointing people to the ASF guidelines and leaving it >>>>>>>>>> at that. >>>>>>>>> >>>>>>>>> I still maintain that trying to publish an exhaustive list of >>>>>>>>> acceptable tools does not seem reasonable. >>>>>>>>> But I agree that giving people guidance is possible. Maybe having a >>>>>>>>> statement in the contribution guidelines along the lines of: >>>>>>>>> “Make sure your tools do X, here are some that at the time of being >>>>>>>>> added to this list did X, Tool A, Tool B … >>>>>>>>> Here is a list of tools that at the time of being added to this list >>>>>>>>> did not satisfy X. Tool Z - reason why” >>>>>>>>> >>>>>>>>> -Jeremiah >>>>>>>>> >>>>>>>>> >>>>>>>>> On Jun 11, 2025 at 11:48:30 AM, Ariel Weisberg <ar...@weisberg.ws> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am not saying you said it, but I respectfully mean that >>>>>>>>>> contributors, reviewers, and committers can't feasibly understand >>>>>>>>>> and enforce the ASF guidelines. We would be another link in a chain >>>>>>>>>> of people abdicating responsibility starting with LLM vendors >>>>>>>>>> serving up models that reproduce copyrighted code, then going to ASF >>>>>>>>>> legal which gives us guidelines without the tools to enforce those >>>>>>>>>> guidelines, and now we (the PMC) would be doing the same to >>>>>>>>>> contributors, reviewers, and committers. >>>>>>>>>> >>>>>>>>>>> I don’t think anyone is going to be able to maintain and enforce a >>>>>>>>>>> list of acceptable tools for contributors to the project to stick >>>>>>>>>>> to. We can’t know what someone did on their laptop, all we can do >>>>>>>>>>> is evaluate the code they submit. >>>>>>>>>> I agree we might not be able to do a perfect job at any aspect of >>>>>>>>>> trying to make sure that the code we accept is not problematic I >>>>>>>>>> some way, but that doesn't mean we shouldn't try? >>>>>>>>>> >>>>>>>>>> We don't even necessarily need to be that restrictive beyond >>>>>>>>>> requiring tools that make at least some effort not to reproduce >>>>>>>>>> large amounts of copyrighted code that may/may not be license >>>>>>>>>> compatible or tools that are themselves not license compatible. This >>>>>>>>>> ends up encompassing most of the ones people want to use anyways. >>>>>>>>>> >>>>>>>>>> How many people are aware that if you get code from OpenAI directly >>>>>>>>>> that the license isn't ASL compatible, but that if you get it via >>>>>>>>>> Microsoft services that use OpenAI models that it's ASL compatible? >>>>>>>>>> It's not in the ASF guidelines (it was but they removed it!). >>>>>>>>>> >>>>>>>>>> How many people are aware that when people use locally run models >>>>>>>>>> there is no output filtering further increasing the odds of the >>>>>>>>>> model reproducing copyright encumbered code? >>>>>>>>>> >>>>>>>>>> There is a non-zero amount we can do to educate and guide that would >>>>>>>>>> be better than pointing people to the ASF guidelines and leaving it >>>>>>>>>> at that. >>>>>>>>>> >>>>>>>>>> The ASF guidelines themselves have suggestions like requiring people >>>>>>>>>> to say if they used AI and then which AI. I don't think it's very >>>>>>>>>> useful beyond checking license compatibility of the AI itself, but >>>>>>>>>> that is something we should be doing so it might as well be >>>>>>>>>> documented and included in the PR text. >>>>>>>>>> >>>>>>>>>> Ariel >>>>>>>>>> >>>>>>>>>> On Mon, Jun 2, 2025, at 7:54 PM, Jeremiah Jordan wrote: >>>>>>>>>>> I don’t think I said we should abdicate responsibility? I said the >>>>>>>>>>> key point is that contributors, and more importantly reviewers and >>>>>>>>>>> committers understand the ASF guidelines and hold all code to those >>>>>>>>>>> standards. Any suspect code should be blocked during review. As >>>>>>>>>>> Roman says in your quote, this isn’t about AI, it’s about >>>>>>>>>>> copyright. If someone submits copyrighted code to the project, >>>>>>>>>>> whether an AI generated it or they just grabbed it from a Google >>>>>>>>>>> search, it’s on the project to try not to accept it. >>>>>>>>>>> I don’t think anyone is going to be able to maintain and enforce a >>>>>>>>>>> list of acceptable tools for contributors to the project to stick >>>>>>>>>>> to. We can’t know what someone did on their laptop, all we can do >>>>>>>>>>> is evaluate the code they submit. >>>>>>>>>>> >>>>>>>>>>> -Jeremiah >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 2, 2025 at 6:39 PM Ariel Weisberg <ar...@weisberg.ws> >>>>>>>>>>> wrote: >>>>>>>>>>>> __ >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> As PMC members/committers we aren't supposed to abdicate this to >>>>>>>>>>>> legal or to contributors. Despite the fact that we aren't equipped >>>>>>>>>>>> to solve this problem we are supposed to be making sure that code >>>>>>>>>>>> contributed is non-infringing. >>>>>>>>>>>> >>>>>>>>>>>> This is a quotation from Roman Shaposhnik from this legal thread >>>>>>>>>>>> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd >>>>>>>>>>>> >>>>>>>>>>>>> Yes, because you have to. Again -- forget about AI -- if a >>>>>>>>>>>>> drive-by contributor submits a patch that has huge amounts of >>>>>>>>>>>>> code stolen from some existing copyright holder -- it is very >>>>>>>>>>>>> much ON YOU as a committer/PMC to prevent that from happening. >>>>>>>>>>>> >>>>>>>>>>>> We aren't supposed to knowingly allow people to use AI tools that >>>>>>>>>>>> are known to generate infringing contributions or contributions >>>>>>>>>>>> which are not license compatible (such as OpenAI terms of use). >>>>>>>>>>>> >>>>>>>>>>>> Ariel >>>>>>>>>>>> On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote: >>>>>>>>>>>>> > Ultimately it's the contributor's (and committer's) job to >>>>>>>>>>>>> > ensure that their contributions meet the bar for acceptance >>>>>>>>>>>>> To me this is the key point. Given how pervasive this stuff is >>>>>>>>>>>>> becoming, I don’t think it’s feasible to make some list of tools >>>>>>>>>>>>> and enforce it. Even without getting into extra tools, IDEs >>>>>>>>>>>>> (including IntelliJ) are doing more and more LLM based code >>>>>>>>>>>>> suggestion as time goes on. >>>>>>>>>>>>> I think we should point people to the ASF Guidelines around such >>>>>>>>>>>>> tools, and the guidelines around copyrighted code, and then >>>>>>>>>>>>> continue to review patches with the high standards we have always >>>>>>>>>>>>> had in this project. >>>>>>>>>>>>> >>>>>>>>>>>>> -Jeremiah >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg <ar...@weisberg.ws> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> __ >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> To clarify are you saying that we should not accept AI generated >>>>>>>>>>>>>> code until it has been looked at by a human and then written >>>>>>>>>>>>>> again with different "wording" to ensure that it doesn't >>>>>>>>>>>>>> directly copy anything? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Or do you mean something else about the quality of "vibe coding" >>>>>>>>>>>>>> and how we shouldn't allow it because it makes bad code? >>>>>>>>>>>>>> Ultimately it's the contributor's (and committer's) job to >>>>>>>>>>>>>> ensure that their contributions meet the bar for acceptance and >>>>>>>>>>>>>> I don't think we should tell them how to go about meeting that >>>>>>>>>>>>>> bar beyond what is needed to address the copyright concern. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I agree that the bar set by the Apache guidelines are pretty >>>>>>>>>>>>>> high. They are simultaneously impossible and trivial to meet >>>>>>>>>>>>>> depending on how you interpret them and we are not very well >>>>>>>>>>>>>> equipped to interpret them. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It would have been more straightforward for them to simply say >>>>>>>>>>>>>> no, but they didn't opt to do that as if there is some way for >>>>>>>>>>>>>> PMCs to acceptably take AI generated contributions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ariel >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote: >>>>>>>>>>>>>>>> fine tuning encourage not reproducing things verbatimI think >>>>>>>>>>>>>>>> not producing copyrighted output from your training data is a >>>>>>>>>>>>>>>> technically feasible achievement for these vendors so I have a >>>>>>>>>>>>>>>> moderate level of trust they will succeed at it if they say >>>>>>>>>>>>>>>> they do it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Some team members and I discussed this in the context of my >>>>>>>>>>>>>>> documentation patch (which utilized Claude during composition). >>>>>>>>>>>>>>> I conducted an experiment to pose high-level Cassandra-related >>>>>>>>>>>>>>> questions to a model without additional context, while >>>>>>>>>>>>>>> adjusting the temperature parameter (tested at 0.2, 0.5, and >>>>>>>>>>>>>>> 0.8). The results revealed that each test generated content >>>>>>>>>>>>>>> copied verbatim from a specific non-Apache (and non-DSE) >>>>>>>>>>>>>>> website. I did not verify whether this content was copyrighted, >>>>>>>>>>>>>>> though it was easily identifiable through a simple Google >>>>>>>>>>>>>>> search. This occurred as a single sentence within the generated >>>>>>>>>>>>>>> document, and as I am not a legal expert, I cannot determine >>>>>>>>>>>>>>> whether this constitutes a significant issue. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The complexity increases when considering models trained on >>>>>>>>>>>>>>> different languages, which may translate content into English. >>>>>>>>>>>>>>> In such cases, a Google search would fail to detect the origin. >>>>>>>>>>>>>>> Is this still considered plagiarism? Does it violate copyright >>>>>>>>>>>>>>> laws? I am uncertain. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Similar challenges arise with code generation. For instance, if >>>>>>>>>>>>>>> a model is trained on a GPL-licensed Python library that >>>>>>>>>>>>>>> implements a novel data structure, and the model subsequently >>>>>>>>>>>>>>> rewrites this structure in Java, a Google search is unlikely to >>>>>>>>>>>>>>> identify the source. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Personally, I do not assume these models will avoid producing >>>>>>>>>>>>>>> copyrighted material. This doesn’t mean I am against AI at all, >>>>>>>>>>>>>>> but rather reflects my belief that the requirements set by >>>>>>>>>>>>>>> Apache are not easily “provable” in such scenarios. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My personal opinion is that we should at least consider allow >>>>>>>>>>>>>>>> listing a few specific sources (any vendor that scans output >>>>>>>>>>>>>>>> for infringement) and add that to the PR template and in other >>>>>>>>>>>>>>>> locations (readme, web site). Bonus points if we can set up >>>>>>>>>>>>>>>> code scanning (useful for non-AI contributions!). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> My perspective, after trying to see what AI can do is the >>>>>>>>>>>>>>> following: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Strengths >>>>>>>>>>>>>>> * Generating a preliminary draft of a document and assisting >>>>>>>>>>>>>>> with iterative revisions >>>>>>>>>>>>>>> * Documenting individual methods >>>>>>>>>>>>>>> * Generation of “simple” methods and scripts, provided the >>>>>>>>>>>>>>> underlying libraries are well-documented in public repositories >>>>>>>>>>>>>>> * Managing repetitive or procedural tasks, such as “migrating >>>>>>>>>>>>>>> from X to Y” or “converting serializations to the X interface” >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Limitations >>>>>>>>>>>>>>> * Producing a fully functional document in a single attempt >>>>>>>>>>>>>>> that meets merge standards. When documenting Gens.java and >>>>>>>>>>>>>>> Property.java, the output appeared plausible but contained >>>>>>>>>>>>>>> frequent inaccuracies. >>>>>>>>>>>>>>> * Addressing complex or ambiguous scenarios (“gossip”), though >>>>>>>>>>>>>>> this challenge is not unique to AI—Matt Byrd and I tested >>>>>>>>>>>>>>> Claude for CASSANDRA-20659, where it could identify relevant >>>>>>>>>>>>>>> code but proposed solutions that risked corrupting production >>>>>>>>>>>>>>> clusters. >>>>>>>>>>>>>>> * Interpreting large-scale codebases. Beyond approximately 300 >>>>>>>>>>>>>>> lines of actual code (excluding formatting), performance >>>>>>>>>>>>>>> degrades significantly, leading to a marked decline in output >>>>>>>>>>>>>>> quality. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Note: When referring to AI/LLMs, I am not discussing >>>>>>>>>>>>>>> interactions with a user interface to execute specific tasks, >>>>>>>>>>>>>>> but rather leveraging code agents like Roo and Aider to provide >>>>>>>>>>>>>>> contextual information to the LLM. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Given these observations, it remains challenging to determine >>>>>>>>>>>>>>> optimal practices. In some contexts its very clear to tell that >>>>>>>>>>>>>>> nothing was taking from external work (e.g., “create a test >>>>>>>>>>>>>>> using our BTree class that inserts a row with a null column,” >>>>>>>>>>>>>>> “analyze this function’s purpose”). However, for substantial >>>>>>>>>>>>>>> tasks, the situation becomes more complex. If the author >>>>>>>>>>>>>>> employed AI as a collaborative tool during “pair programming,” >>>>>>>>>>>>>>> concerns are not really that different than google searches >>>>>>>>>>>>>>> (unless the work involves unique elements like introducing new >>>>>>>>>>>>>>> data structures or indexes). Conversely, if the author “vibe >>>>>>>>>>>>>>> coded” the entire patch, two primary concerns arise: does the >>>>>>>>>>>>>>> author have writes to the code and whether its quality aligns >>>>>>>>>>>>>>> with requirements. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> TL;DR - I am not against AI contributions, but strongly prefer >>>>>>>>>>>>>>> its done as “pair programing”. My experience with “vibe >>>>>>>>>>>>>>> coding” makes me worry about the quality of the code, and that >>>>>>>>>>>>>>> the author is less likely to validate that the code generated >>>>>>>>>>>>>>> is safe to donate. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This email was generated with the help of AI =) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On May 30, 2025, at 3:00 PM, Ariel Weisberg >>>>>>>>>>>>>>>> <ar...@weisberg.ws> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It looks like we haven't discussed this much and haven't >>>>>>>>>>>>>>>> settled on a policy for what kinds of AI generated >>>>>>>>>>>>>>>> contributions we accept and what vetting is required for them. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ``` >>>>>>>>>>>>>>>> Given the above, code generated in whole or in part using AI >>>>>>>>>>>>>>>> can be contributed if the contributor ensures that: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. The terms and conditions of the generative AI tool do not >>>>>>>>>>>>>>>> place any restrictions on use of the output that would be >>>>>>>>>>>>>>>> inconsistent with the Open Source Definition. >>>>>>>>>>>>>>>> 2. At least one of the following conditions is met: >>>>>>>>>>>>>>>> 2.1 The output is not copyrightable subject matter (and >>>>>>>>>>>>>>>> would not be even if produced by a human). >>>>>>>>>>>>>>>> 2.2 No third party materials are included in the output. >>>>>>>>>>>>>>>> 2.3 Any third party materials that are included in the >>>>>>>>>>>>>>>> output are being used with permission (e.g., under a >>>>>>>>>>>>>>>> compatible open-source license) of the third party copyright >>>>>>>>>>>>>>>> holders and in compliance with the applicable license terms. >>>>>>>>>>>>>>>> 3. A contributor obtains reasonable certainty that conditions >>>>>>>>>>>>>>>> 2.2 or 2.3 are met if the AI tool itself provides sufficient >>>>>>>>>>>>>>>> information about output that may be similar to training data, >>>>>>>>>>>>>>>> or from code scanning results. >>>>>>>>>>>>>>>> ``` >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There is a lot to unpack there, but it seems like any one of 2 >>>>>>>>>>>>>>>> needs to be met, and 3 describes how 2.2 and 2.3 can be >>>>>>>>>>>>>>>> satisfied. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2.1 is tricky as we are not copyright lawyers, and 2.2 and 2.3 >>>>>>>>>>>>>>>> is a pretty high bar in that it's hard to know if you have met >>>>>>>>>>>>>>>> it. Do we have anyone in the community running any code >>>>>>>>>>>>>>>> scanning tools already? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here is the JIRA for addition of the generative AI policy: >>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/LEGAL-631 >>>>>>>>>>>>>>>> Legal mailing list discussion of the policy: >>>>>>>>>>>>>>>> https://lists.apache.org/thread/vw3jf4726yrhovg39mcz1y89mx8j4t8s >>>>>>>>>>>>>>>> Legal mailing list discussion of compliant tools: >>>>>>>>>>>>>>>> https://lists.apache.org/thread/nzyl311q53xhpq99grf6l1h076lgzybr >>>>>>>>>>>>>>>> Legal mailing list discussion about how Open AI terms are not >>>>>>>>>>>>>>>> Apache compatible: >>>>>>>>>>>>>>>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 >>>>>>>>>>>>>>>> Hadoop mailing list message hinting that they accept >>>>>>>>>>>>>>>> contributions but ask which tool: >>>>>>>>>>>>>>>> https://lists.apache.org/thread/bgs8x1f9ovrjmhg6b450bz8bt7o43yxj >>>>>>>>>>>>>>>> Spark mailing list message where they have given up on >>>>>>>>>>>>>>>> stopping people: >>>>>>>>>>>>>>>> https://lists.apache.org/thread/h6621sxfxcnnpsoyr31x65z207kk80fr >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I didn't see other projects discussing and deciding how to >>>>>>>>>>>>>>>> handle these contributions, but I also didn't check that many >>>>>>>>>>>>>>>> of them only Hadoop, Spark, Druid, Pulsar. I also can't see >>>>>>>>>>>>>>>> their PMC mailing list. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I asked O3 to deep research what is done to avoid producing >>>>>>>>>>>>>>>> copyrighted code: >>>>>>>>>>>>>>>> https://chatgpt.com/share/683a2983-dd9c-8009-9a66-425012af840d >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> To summarize training deduplicates training so the model is >>>>>>>>>>>>>>>> less likely to spit reproduce it verbatim, prompts and fine >>>>>>>>>>>>>>>> tuning encourage not reproducing things verbatim, the >>>>>>>>>>>>>>>> inference is biased to not pick the best option but some >>>>>>>>>>>>>>>> neighboring one encouraging originality, and in some instances >>>>>>>>>>>>>>>> the output is checked to make sure it doesn't match the >>>>>>>>>>>>>>>> training data. So to some extent 2.2 is being done to >>>>>>>>>>>>>>>> different degrees depending on what product you are using. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It's worth noting that scanning the output can be >>>>>>>>>>>>>>>> probabilistic in the case of say Anthropic and they still >>>>>>>>>>>>>>>> recommend code scanning. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Quite notably Anthropic for its enterprise users indemnifies >>>>>>>>>>>>>>>> them against copyright claims. It's not perfect, but it does >>>>>>>>>>>>>>>> mean they have an incentive to make sure there are fewer >>>>>>>>>>>>>>>> copyright claims. We could choose to be picky and only accept >>>>>>>>>>>>>>>> specific sources of LLM generated code based on perceived >>>>>>>>>>>>>>>> safety. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think not producing copyrighted output from your training >>>>>>>>>>>>>>>> data is a technically feasible achievement for these vendors >>>>>>>>>>>>>>>> so I have a moderate level of trust they will succeed at it if >>>>>>>>>>>>>>>> they say they do it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I could send a message to the legal list asking for >>>>>>>>>>>>>>>> clarification and a set of tools, but based on Roman's >>>>>>>>>>>>>>>> communication >>>>>>>>>>>>>>>> (https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd) >>>>>>>>>>>>>>>> I think this is kind of what we get. It's on us to ensure the >>>>>>>>>>>>>>>> contributions are kosher either by code scanning or accepting >>>>>>>>>>>>>>>> that the LLM vendors are doing a good job at avoiding >>>>>>>>>>>>>>>> copyrighted output. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My personal opinion is that we should at least consider allow >>>>>>>>>>>>>>>> listing a few specific sources (any vendor that scans output >>>>>>>>>>>>>>>> for infringement) and add that to the PR template and in other >>>>>>>>>>>>>>>> locations (readme, web site). Bonus points if we can set up >>>>>>>>>>>>>>>> code scanning (useful for non-AI contributions!). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Ariel >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >> >