> These change very often and it's a constant moving target. This is not hyperbole. This area is moving faster than anything I've seen before.
> I am in this camp at the moment, AI vs Human has the same problem for the > reviewer; we are supposed to be doing this, and blocking AI or putting new > rules around AI doesn't really change anything, we are still supposed to do > this work. +1. > I am personally in the stance that disclosure (which is the ASF policy) is > best for the time being; nothing in this thread has motivated me to change > the current policy. Yep. Option 2 - guidance and disclosure makes the most sense to me after reading this thread. On Tue, Jun 24, 2025, at 5:09 PM, David Capwell wrote: > > It's not clear from that thread precisely what they are objecting to and > > whether it has changed (another challenge!) > > That thread was last updated in 2023 and the current stance is just "tell > people which one you used, and make sure the output follows the 3 main > points". > > > We can make a best effort to just vet the ones that people actually want to > > widely use and refuse everything else and be better off than allowing > > people to use tools that are known not to be license compatible or make > > little/no effort to avoid reproducing large amounts of copyrighted code. > > How often are we going to "vet" new tools? These change very often and it's > a constant moving target. Are we going to expect someone to do this vetting, > give the pros/cons of what has changed since the last vote, then revote every > 6 months? What does "vet" even mean? > > > allowing people to use tools that are known not to be license compatible > > Which tools are you referring to? The major providers all document that the > output is owned by the entity that requested it. > > > make little/no effort to avoid reproducing large amounts of copyrighted > > code. > > How do you go about qualifying that? Which tools / services are you > referring to? How do you go about evaluating them? > > > If someone submits copyrighted code to the project, whether an AI generated > > it or they just grabbed it from a Google search, it’s on the project to try > > not to accept it. > > I am in this camp at the moment, AI vs Human has the same problem for the > reviewer; we are supposed to be doing this, and blocking AI or putting new > rules around AI doesn't really change anything, we are still supposed to do > this work. > > > What would you want? > > My vote would be on 2/3 given the list from Ariel. But I am personally in > the stance that disclosure (which is the ASF policy) is best for the time > being; nothing in this thread has motivated me to change the current policy. > > On Mon, Jun 16, 2025 at 4:21 PM Patrick McFadin <pmcfa...@gmail.com> wrote: >> I'm on with the allow list(1) or option 2. 3 just isn't realistic anymore. >> >> Patrick >> >> >> >> On Mon, Jun 16, 2025 at 3:09 PM Caleb Rackliffe <calebrackli...@gmail.com> >> wrote: >>> I haven't participated much here, but my vote would be basically #1, i.e. >>> an "allow list" with a clear procedure for expansion. >>> >>> On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg <ar...@weisberg.ws> wrote: >>>> __ >>>> Hi, >>>> >>>> We could, but if the allow list is binding then it's still an allow list >>>> with some guidance on how to expand the allow list. >>>> >>>> If it isn't binding then it's guidance so still option 2 really. >>>> >>>> I think the key distinction to find some early consensus on if we do a >>>> binding allow list or guidance, and then we can iron out the guidance, but >>>> I think that will be less controversial to work out. >>>> >>>> Or option 3 which is not accepting AI generated contributions. I think >>>> there are some with healthy skepticism of AI generated code, but so far I >>>> haven't met anyone who wants to forbid it entirely. >>>> >>>> Ariel >>>> >>>> On Mon, Jun 16, 2025, at 4:54 PM, Josh McKenzie wrote: >>>>> Couldn't our official stance be a combination of 1 and 2? i.e. "Here's an >>>>> allow list. If you're using something not on that allow list, here's some >>>>> basic guidance and maybe let us know how you tried to mitigate some of >>>>> this risk so we can update our allow list w/some nuance". >>>>> >>>>> On Mon, Jun 16, 2025, at 4:39 PM, Ariel Weisberg wrote: >>>>>> Hi, >>>>>> >>>>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: >>>>>>> Where are you getting this from? From the OpenAI terms of use: >>>>>>> https://openai.com/policies/terms-of-use/ >>>>>> Direct from the ASF legal mailing list discussion I linked to in my >>>>>> original email calling this out >>>>>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 It's >>>>>> not clear from that thread precisely what they are objecting to and >>>>>> whether it has changed (another challenge!), but I believe it's >>>>>> restrictions on what you are allowed to do with the output of OpenAI >>>>>> models. And if you get the output via other service's it's under a >>>>>> different license and it's fine! >>>>>> >>>>>> Already we are demonstrating that it is not trivial understand what is >>>>>> and isn't allowed >>>>>> >>>>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: >>>>>>> I still maintain that trying to publish an exhaustive list of >>>>>>> acceptable tools does not seem reasonable. >>>>>>> But I agree that giving people guidance is possible. Maybe having a >>>>>>> statement in the contribution guidelines along the lines of: >>>>>> The list doesn't need to be exhaustive. We are not required to accept AI >>>>>> generated code at all! >>>>>> >>>>>> We can make a best effort to just vet the ones that people actually want >>>>>> to widely use and refuse everything else and be better off than allowing >>>>>> people to use tools that are known not to be license compatible or make >>>>>> little/no effort to avoid reproducing large amounts of copyrighted code. >>>>>> >>>>>>> “Make sure your tools do X, here are some that at the time of being >>>>>>> added to this list did X, Tool A, Tool B … >>>>>>> Here is a list of tools that at the time of being added to this list >>>>>>> did not satisfy X. Tool Z - reason why” >>>>>> I would be fine with this as an outcome. If we voted with multiple >>>>>> options it wouldn't be my first choice. >>>>>> >>>>>> This thread only has 4 participants so far so it's hard to get a signal >>>>>> on what people would want if we tried to vote. >>>>>> >>>>>> David, Scott, anyone else if the options were: >>>>>> 1. Allow list >>>>>> 2. Basic guidance as suggested by Jeremiah, but primarily leave it up >>>>>> to contributor/reviewer >>>>>> 3. Do nothing >>>>>> 4. My choice isn't here >>>>>> What would you want? >>>>>> >>>>>> My vote in choice order is 1,2,3. >>>>>> >>>>>> Ariel >>>>>> >>>>>> >>>>>> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: >>>>>>> >>>>>>>> I respectfully mean that contributors, reviewers, and committers >>>>>>>> can't feasibly understand and enforce the ASF guidelines. >>>>>>> If this is true, then the ASF is in a lot of trouble and you should >>>>>>> bring it up with the ASF board. >>>>>>> Where are you getting this from? From the OpenAI terms of use: >>>>>>> https://openai.com/policies/terms-of-use/ >>>>>>> >>>>>>>> We don't even necessarily need to be that restrictive beyond requiring >>>>>>>> tools that make at least some effort not to reproduce large amounts of >>>>>>>> copyrighted code that may/may not be license compatible or tools that >>>>>>>> are themselves not license compatible. This ends up encompassing most >>>>>>>> of the ones people want to use anyways. >>>>>>> >>>>>>>> There is a non-zero amount we can do to educate and guide that would >>>>>>>> be better than pointing people to the ASF guidelines and leaving it at >>>>>>>> that. >>>>>>> >>>>>>> I still maintain that trying to publish an exhaustive list of >>>>>>> acceptable tools does not seem reasonable. >>>>>>> But I agree that giving people guidance is possible. Maybe having a >>>>>>> statement in the contribution guidelines along the lines of: >>>>>>> “Make sure your tools do X, here are some that at the time of being >>>>>>> added to this list did X, Tool A, Tool B … >>>>>>> Here is a list of tools that at the time of being added to this list >>>>>>> did not satisfy X. Tool Z - reason why” >>>>>>> >>>>>>> -Jeremiah >>>>>>> >>>>>>> >>>>>>> On Jun 11, 2025 at 11:48:30 AM, Ariel Weisberg <ar...@weisberg.ws> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am not saying you said it, but I respectfully mean that >>>>>>>> contributors, reviewers, and committers can't feasibly understand and >>>>>>>> enforce the ASF guidelines. We would be another link in a chain of >>>>>>>> people abdicating responsibility starting with LLM vendors serving up >>>>>>>> models that reproduce copyrighted code, then going to ASF legal which >>>>>>>> gives us guidelines without the tools to enforce those guidelines, and >>>>>>>> now we (the PMC) would be doing the same to contributors, reviewers, >>>>>>>> and committers. >>>>>>>> >>>>>>>>> I don’t think anyone is going to be able to maintain and enforce a >>>>>>>>> list of acceptable tools for contributors to the project to stick to. >>>>>>>>> We can’t know what someone did on their laptop, all we can do is >>>>>>>>> evaluate the code they submit. >>>>>>>> I agree we might not be able to do a perfect job at any aspect of >>>>>>>> trying to make sure that the code we accept is not problematic I some >>>>>>>> way, but that doesn't mean we shouldn't try? >>>>>>>> >>>>>>>> We don't even necessarily need to be that restrictive beyond requiring >>>>>>>> tools that make at least some effort not to reproduce large amounts of >>>>>>>> copyrighted code that may/may not be license compatible or tools that >>>>>>>> are themselves not license compatible. This ends up encompassing most >>>>>>>> of the ones people want to use anyways. >>>>>>>> >>>>>>>> How many people are aware that if you get code from OpenAI directly >>>>>>>> that the license isn't ASL compatible, but that if you get it via >>>>>>>> Microsoft services that use OpenAI models that it's ASL compatible? >>>>>>>> It's not in the ASF guidelines (it was but they removed it!). >>>>>>>> >>>>>>>> How many people are aware that when people use locally run models >>>>>>>> there is no output filtering further increasing the odds of the model >>>>>>>> reproducing copyright encumbered code? >>>>>>>> >>>>>>>> There is a non-zero amount we can do to educate and guide that would >>>>>>>> be better than pointing people to the ASF guidelines and leaving it at >>>>>>>> that. >>>>>>>> >>>>>>>> The ASF guidelines themselves have suggestions like requiring people >>>>>>>> to say if they used AI and then which AI. I don't think it's very >>>>>>>> useful beyond checking license compatibility of the AI itself, but >>>>>>>> that is something we should be doing so it might as well be documented >>>>>>>> and included in the PR text. >>>>>>>> >>>>>>>> Ariel >>>>>>>> >>>>>>>> On Mon, Jun 2, 2025, at 7:54 PM, Jeremiah Jordan wrote: >>>>>>>>> I don’t think I said we should abdicate responsibility? I said the >>>>>>>>> key point is that contributors, and more importantly reviewers and >>>>>>>>> committers understand the ASF guidelines and hold all code to those >>>>>>>>> standards. Any suspect code should be blocked during review. As Roman >>>>>>>>> says in your quote, this isn’t about AI, it’s about copyright. If >>>>>>>>> someone submits copyrighted code to the project, whether an AI >>>>>>>>> generated it or they just grabbed it from a Google search, it’s on >>>>>>>>> the project to try not to accept it. >>>>>>>>> I don’t think anyone is going to be able to maintain and enforce a >>>>>>>>> list of acceptable tools for contributors to the project to stick to. >>>>>>>>> We can’t know what someone did on their laptop, all we can do is >>>>>>>>> evaluate the code they submit. >>>>>>>>> >>>>>>>>> -Jeremiah >>>>>>>>> >>>>>>>>> On Mon, Jun 2, 2025 at 6:39 PM Ariel Weisberg <ar...@weisberg.ws> >>>>>>>>> wrote: >>>>>>>>>> __ >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> As PMC members/committers we aren't supposed to abdicate this to >>>>>>>>>> legal or to contributors. Despite the fact that we aren't equipped >>>>>>>>>> to solve this problem we are supposed to be making sure that code >>>>>>>>>> contributed is non-infringing. >>>>>>>>>> >>>>>>>>>> This is a quotation from Roman Shaposhnik from this legal thread >>>>>>>>>> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd >>>>>>>>>> >>>>>>>>>>> Yes, because you have to. Again -- forget about AI -- if a drive-by >>>>>>>>>>> contributor submits a patch that has huge amounts of code stolen >>>>>>>>>>> from some existing copyright holder -- it is very much ON YOU as a >>>>>>>>>>> committer/PMC to prevent that from happening. >>>>>>>>>> >>>>>>>>>> We aren't supposed to knowingly allow people to use AI tools that >>>>>>>>>> are known to generate infringing contributions or contributions >>>>>>>>>> which are not license compatible (such as OpenAI terms of use). >>>>>>>>>> >>>>>>>>>> Ariel >>>>>>>>>> On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote: >>>>>>>>>>> > Ultimately it's the contributor's (and committer's) job to ensure >>>>>>>>>>> > that their contributions meet the bar for acceptance >>>>>>>>>>> To me this is the key point. Given how pervasive this stuff is >>>>>>>>>>> becoming, I don’t think it’s feasible to make some list of tools >>>>>>>>>>> and enforce it. Even without getting into extra tools, IDEs >>>>>>>>>>> (including IntelliJ) are doing more and more LLM based code >>>>>>>>>>> suggestion as time goes on. >>>>>>>>>>> I think we should point people to the ASF Guidelines around such >>>>>>>>>>> tools, and the guidelines around copyrighted code, and then >>>>>>>>>>> continue to review patches with the high standards we have always >>>>>>>>>>> had in this project. >>>>>>>>>>> >>>>>>>>>>> -Jeremiah >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg <ar...@weisberg.ws> >>>>>>>>>>> wrote: >>>>>>>>>>>> __ >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> To clarify are you saying that we should not accept AI generated >>>>>>>>>>>> code until it has been looked at by a human and then written again >>>>>>>>>>>> with different "wording" to ensure that it doesn't directly copy >>>>>>>>>>>> anything? >>>>>>>>>>>> >>>>>>>>>>>> Or do you mean something else about the quality of "vibe coding" >>>>>>>>>>>> and how we shouldn't allow it because it makes bad code? >>>>>>>>>>>> Ultimately it's the contributor's (and committer's) job to ensure >>>>>>>>>>>> that their contributions meet the bar for acceptance and I don't >>>>>>>>>>>> think we should tell them how to go about meeting that bar beyond >>>>>>>>>>>> what is needed to address the copyright concern. >>>>>>>>>>>> >>>>>>>>>>>> I agree that the bar set by the Apache guidelines are pretty high. >>>>>>>>>>>> They are simultaneously impossible and trivial to meet depending >>>>>>>>>>>> on how you interpret them and we are not very well equipped to >>>>>>>>>>>> interpret them. >>>>>>>>>>>> >>>>>>>>>>>> It would have been more straightforward for them to simply say no, >>>>>>>>>>>> but they didn't opt to do that as if there is some way for PMCs to >>>>>>>>>>>> acceptably take AI generated contributions. >>>>>>>>>>>> >>>>>>>>>>>> Ariel >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote: >>>>>>>>>>>>>> fine tuning encourage not reproducing things verbatimI think not >>>>>>>>>>>>>> producing copyrighted output from your training data is a >>>>>>>>>>>>>> technically feasible achievement for these vendors so I have a >>>>>>>>>>>>>> moderate level of trust they will succeed at it if they say they >>>>>>>>>>>>>> do it. >>>>>>>>>>>>> >>>>>>>>>>>>> Some team members and I discussed this in the context of my >>>>>>>>>>>>> documentation patch (which utilized Claude during composition). I >>>>>>>>>>>>> conducted an experiment to pose high-level Cassandra-related >>>>>>>>>>>>> questions to a model without additional context, while adjusting >>>>>>>>>>>>> the temperature parameter (tested at 0.2, 0.5, and 0.8). The >>>>>>>>>>>>> results revealed that each test generated content copied verbatim >>>>>>>>>>>>> from a specific non-Apache (and non-DSE) website. I did not >>>>>>>>>>>>> verify whether this content was copyrighted, though it was easily >>>>>>>>>>>>> identifiable through a simple Google search. This occurred as a >>>>>>>>>>>>> single sentence within the generated document, and as I am not a >>>>>>>>>>>>> legal expert, I cannot determine whether this constitutes a >>>>>>>>>>>>> significant issue. >>>>>>>>>>>>> >>>>>>>>>>>>> The complexity increases when considering models trained on >>>>>>>>>>>>> different languages, which may translate content into English. In >>>>>>>>>>>>> such cases, a Google search would fail to detect the origin. Is >>>>>>>>>>>>> this still considered plagiarism? Does it violate copyright laws? >>>>>>>>>>>>> I am uncertain. >>>>>>>>>>>>> >>>>>>>>>>>>> Similar challenges arise with code generation. For instance, if a >>>>>>>>>>>>> model is trained on a GPL-licensed Python library that implements >>>>>>>>>>>>> a novel data structure, and the model subsequently rewrites this >>>>>>>>>>>>> structure in Java, a Google search is unlikely to identify the >>>>>>>>>>>>> source. >>>>>>>>>>>>> >>>>>>>>>>>>> Personally, I do not assume these models will avoid producing >>>>>>>>>>>>> copyrighted material. This doesn’t mean I am against AI at all, >>>>>>>>>>>>> but rather reflects my belief that the requirements set by Apache >>>>>>>>>>>>> are not easily “provable” in such scenarios. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> My personal opinion is that we should at least consider allow >>>>>>>>>>>>>> listing a few specific sources (any vendor that scans output for >>>>>>>>>>>>>> infringement) and add that to the PR template and in other >>>>>>>>>>>>>> locations (readme, web site). Bonus points if we can set up code >>>>>>>>>>>>>> scanning (useful for non-AI contributions!). >>>>>>>>>>>>> >>>>>>>>>>>>> My perspective, after trying to see what AI can do is the >>>>>>>>>>>>> following: >>>>>>>>>>>>> >>>>>>>>>>>>> Strengths >>>>>>>>>>>>> * Generating a preliminary draft of a document and assisting with >>>>>>>>>>>>> iterative revisions >>>>>>>>>>>>> * Documenting individual methods >>>>>>>>>>>>> * Generation of “simple” methods and scripts, provided the >>>>>>>>>>>>> underlying libraries are well-documented in public repositories >>>>>>>>>>>>> * Managing repetitive or procedural tasks, such as “migrating >>>>>>>>>>>>> from X to Y” or “converting serializations to the X interface” >>>>>>>>>>>>> >>>>>>>>>>>>> Limitations >>>>>>>>>>>>> * Producing a fully functional document in a single attempt that >>>>>>>>>>>>> meets merge standards. When documenting Gens.java and >>>>>>>>>>>>> Property.java, the output appeared plausible but contained >>>>>>>>>>>>> frequent inaccuracies. >>>>>>>>>>>>> * Addressing complex or ambiguous scenarios (“gossip”), though >>>>>>>>>>>>> this challenge is not unique to AI—Matt Byrd and I tested Claude >>>>>>>>>>>>> for CASSANDRA-20659, where it could identify relevant code but >>>>>>>>>>>>> proposed solutions that risked corrupting production clusters. >>>>>>>>>>>>> * Interpreting large-scale codebases. Beyond approximately 300 >>>>>>>>>>>>> lines of actual code (excluding formatting), performance degrades >>>>>>>>>>>>> significantly, leading to a marked decline in output quality. >>>>>>>>>>>>> >>>>>>>>>>>>> Note: When referring to AI/LLMs, I am not discussing interactions >>>>>>>>>>>>> with a user interface to execute specific tasks, but rather >>>>>>>>>>>>> leveraging code agents like Roo and Aider to provide contextual >>>>>>>>>>>>> information to the LLM. >>>>>>>>>>>>> >>>>>>>>>>>>> Given these observations, it remains challenging to determine >>>>>>>>>>>>> optimal practices. In some contexts its very clear to tell that >>>>>>>>>>>>> nothing was taking from external work (e.g., “create a test using >>>>>>>>>>>>> our BTree class that inserts a row with a null column,” “analyze >>>>>>>>>>>>> this function’s purpose”). However, for substantial tasks, the >>>>>>>>>>>>> situation becomes more complex. If the author employed AI as a >>>>>>>>>>>>> collaborative tool during “pair programming,” concerns are not >>>>>>>>>>>>> really that different than google searches (unless the work >>>>>>>>>>>>> involves unique elements like introducing new data structures or >>>>>>>>>>>>> indexes). Conversely, if the author “vibe coded” the entire >>>>>>>>>>>>> patch, two primary concerns arise: does the author have writes to >>>>>>>>>>>>> the code and whether its quality aligns with requirements. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> TL;DR - I am not against AI contributions, but strongly prefer >>>>>>>>>>>>> its done as “pair programing”. My experience with “vibe coding” >>>>>>>>>>>>> makes me worry about the quality of the code, and that the author >>>>>>>>>>>>> is less likely to validate that the code generated is safe to >>>>>>>>>>>>> donate. >>>>>>>>>>>>> >>>>>>>>>>>>> This email was generated with the help of AI =) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On May 30, 2025, at 3:00 PM, Ariel Weisberg <ar...@weisberg.ws> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> It looks like we haven't discussed this much and haven't settled >>>>>>>>>>>>>> on a policy for what kinds of AI generated contributions we >>>>>>>>>>>>>> accept and what vetting is required for them. >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results. >>>>>>>>>>>>>> >>>>>>>>>>>>>> ``` >>>>>>>>>>>>>> Given the above, code generated in whole or in part using AI can >>>>>>>>>>>>>> be contributed if the contributor ensures that: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. The terms and conditions of the generative AI tool do not >>>>>>>>>>>>>> place any restrictions on use of the output that would be >>>>>>>>>>>>>> inconsistent with the Open Source Definition. >>>>>>>>>>>>>> 2. At least one of the following conditions is met: >>>>>>>>>>>>>> 2.1 The output is not copyrightable subject matter (and would >>>>>>>>>>>>>> not be even if produced by a human). >>>>>>>>>>>>>> 2.2 No third party materials are included in the output. >>>>>>>>>>>>>> 2.3 Any third party materials that are included in the output >>>>>>>>>>>>>> are being used with permission (e.g., under a compatible >>>>>>>>>>>>>> open-source license) of the third party copyright holders and in >>>>>>>>>>>>>> compliance with the applicable license terms. >>>>>>>>>>>>>> 3. A contributor obtains reasonable certainty that conditions >>>>>>>>>>>>>> 2.2 or 2.3 are met if the AI tool itself provides sufficient >>>>>>>>>>>>>> information about output that may be similar to training data, >>>>>>>>>>>>>> or from code scanning results. >>>>>>>>>>>>>> ``` >>>>>>>>>>>>>> >>>>>>>>>>>>>> There is a lot to unpack there, but it seems like any one of 2 >>>>>>>>>>>>>> needs to be met, and 3 describes how 2.2 and 2.3 can be >>>>>>>>>>>>>> satisfied. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2.1 is tricky as we are not copyright lawyers, and 2.2 and 2.3 >>>>>>>>>>>>>> is a pretty high bar in that it's hard to know if you have met >>>>>>>>>>>>>> it. Do we have anyone in the community running any code scanning >>>>>>>>>>>>>> tools already? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here is the JIRA for addition of the generative AI policy: >>>>>>>>>>>>>> https://issues.apache.org/jira/browse/LEGAL-631 >>>>>>>>>>>>>> Legal mailing list discussion of the policy: >>>>>>>>>>>>>> https://lists.apache.org/thread/vw3jf4726yrhovg39mcz1y89mx8j4t8s >>>>>>>>>>>>>> Legal mailing list discussion of compliant tools: >>>>>>>>>>>>>> https://lists.apache.org/thread/nzyl311q53xhpq99grf6l1h076lgzybr >>>>>>>>>>>>>> Legal mailing list discussion about how Open AI terms are not >>>>>>>>>>>>>> Apache compatible: >>>>>>>>>>>>>> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 >>>>>>>>>>>>>> Hadoop mailing list message hinting that they accept >>>>>>>>>>>>>> contributions but ask which tool: >>>>>>>>>>>>>> https://lists.apache.org/thread/bgs8x1f9ovrjmhg6b450bz8bt7o43yxj >>>>>>>>>>>>>> Spark mailing list message where they have given up on stopping >>>>>>>>>>>>>> people: >>>>>>>>>>>>>> https://lists.apache.org/thread/h6621sxfxcnnpsoyr31x65z207kk80fr >>>>>>>>>>>>>> >>>>>>>>>>>>>> I didn't see other projects discussing and deciding how to >>>>>>>>>>>>>> handle these contributions, but I also didn't check that many of >>>>>>>>>>>>>> them only Hadoop, Spark, Druid, Pulsar. I also can't see their >>>>>>>>>>>>>> PMC mailing list. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I asked O3 to deep research what is done to avoid producing >>>>>>>>>>>>>> copyrighted code: >>>>>>>>>>>>>> https://chatgpt.com/share/683a2983-dd9c-8009-9a66-425012af840d >>>>>>>>>>>>>> >>>>>>>>>>>>>> To summarize training deduplicates training so the model is less >>>>>>>>>>>>>> likely to spit reproduce it verbatim, prompts and fine tuning >>>>>>>>>>>>>> encourage not reproducing things verbatim, the inference is >>>>>>>>>>>>>> biased to not pick the best option but some neighboring one >>>>>>>>>>>>>> encouraging originality, and in some instances the output is >>>>>>>>>>>>>> checked to make sure it doesn't match the training data. So to >>>>>>>>>>>>>> some extent 2.2 is being done to different degrees depending on >>>>>>>>>>>>>> what product you are using. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It's worth noting that scanning the output can be probabilistic >>>>>>>>>>>>>> in the case of say Anthropic and they still recommend code >>>>>>>>>>>>>> scanning. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Quite notably Anthropic for its enterprise users indemnifies >>>>>>>>>>>>>> them against copyright claims. It's not perfect, but it does >>>>>>>>>>>>>> mean they have an incentive to make sure there are fewer >>>>>>>>>>>>>> copyright claims. We could choose to be picky and only accept >>>>>>>>>>>>>> specific sources of LLM generated code based on perceived safety. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think not producing copyrighted output from your training data >>>>>>>>>>>>>> is a technically feasible achievement for these vendors so I >>>>>>>>>>>>>> have a moderate level of trust they will succeed at it if they >>>>>>>>>>>>>> say they do it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I could send a message to the legal list asking for >>>>>>>>>>>>>> clarification and a set of tools, but based on Roman's >>>>>>>>>>>>>> communication >>>>>>>>>>>>>> (https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd) >>>>>>>>>>>>>> I think this is kind of what we get. It's on us to ensure the >>>>>>>>>>>>>> contributions are kosher either by code scanning or accepting >>>>>>>>>>>>>> that the LLM vendors are doing a good job at avoiding >>>>>>>>>>>>>> copyrighted output. >>>>>>>>>>>>>> >>>>>>>>>>>>>> My personal opinion is that we should at least consider allow >>>>>>>>>>>>>> listing a few specific sources (any vendor that scans output for >>>>>>>>>>>>>> infringement) and add that to the PR template and in other >>>>>>>>>>>>>> locations (readme, web site). Bonus points if we can set up code >>>>>>>>>>>>>> scanning (useful for non-AI contributions!). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Ariel >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>