I haven't participated much here, but my vote would be basically #1, i.e. an "allow list" with a clear procedure for expansion.
On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg <ar...@weisberg.ws> wrote: > Hi, > > We could, but if the allow list is binding then it's still an allow list > with some guidance on how to expand the allow list. > > If it isn't binding then it's guidance so still option 2 really. > > I think the key distinction to find some early consensus on if we do a > binding allow list or guidance, and then we can iron out the guidance, but > I think that will be less controversial to work out. > > Or option 3 which is not accepting AI generated contributions. I think > there are some with healthy skepticism of AI generated code, but so far I > haven't met anyone who wants to forbid it entirely. > > Ariel > > On Mon, Jun 16, 2025, at 4:54 PM, Josh McKenzie wrote: > > Couldn't our official stance be a combination of 1 and 2? i.e. "Here's an > allow list. If you're using something not on that allow list, here's some > basic guidance and maybe let us know how you tried to mitigate some of this > risk so we can update our allow list w/some nuance". > > On Mon, Jun 16, 2025, at 4:39 PM, Ariel Weisberg wrote: > > Hi, > > On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: > > Where are you getting this from? From the OpenAI terms of use: > https://openai.com/policies/terms-of-use/ > > Direct from the ASF legal mailing list discussion I linked to in my > original email calling this out > https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 It's not > clear from that thread precisely what they are objecting to and whether it > has changed (another challenge!), but I believe it's restrictions on what > you are allowed to do with the output of OpenAI models. And if you get the > output via other service's it's under a different license and it's fine! > > Already we are demonstrating that it is not trivial understand what is and > isn't allowed > > On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: > > I still maintain that trying to publish an exhaustive list of acceptable > tools does not seem reasonable. > But I agree that giving people guidance is possible. Maybe having a > statement in the contribution guidelines along the lines of: > > The list doesn't need to be exhaustive. We are not required to accept AI > generated code at all! > > We can make a best effort to just vet the ones that people actually want > to widely use and refuse everything else and be better off than allowing > people to use tools that are known not to be license compatible or make > little/no effort to avoid reproducing large amounts of copyrighted code. > > “Make sure your tools do X, here are some that at the time of being added > to this list did X, Tool A, Tool B … > Here is a list of tools that at the time of being added to this list did > not satisfy X. Tool Z - reason why” > > I would be fine with this as an outcome. If we voted with multiple options > it wouldn't be my first choice. > > This thread only has 4 participants so far so it's hard to get a signal on > what people would want if we tried to vote. > > David, Scott, anyone else if the options were: > > 1. Allow list > 2. Basic guidance as suggested by Jeremiah, but primarily leave it up > to contributor/reviewer > 3. Do nothing > 4. My choice isn't here > > What would you want? > > My vote in choice order is 1,2,3. > > Ariel > > > On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote: > > > I respectfully mean that contributors, reviewers, and committers can't > feasibly understand and enforce the ASF guidelines. > > If this is true, then the ASF is in a lot of trouble and you should bring > it up with the ASF board. > Where are you getting this from? From the OpenAI terms of use: > https://openai.com/policies/terms-of-use/ > > We don't even necessarily need to be that restrictive beyond requiring > tools that make at least some effort not to reproduce large amounts of > copyrighted code that may/may not be license compatible or tools that are > themselves not license compatible. This ends up encompassing most of the > ones people want to use anyways. > > > There is a non-zero amount we can do to educate and guide that would be > better than pointing people to the ASF guidelines and leaving it at that. > > > I still maintain that trying to publish an exhaustive list of acceptable > tools does not seem reasonable. > But I agree that giving people guidance is possible. Maybe having a > statement in the contribution guidelines along the lines of: > “Make sure your tools do X, here are some that at the time of being added > to this list did X, Tool A, Tool B … > Here is a list of tools that at the time of being added to this list did > not satisfy X. Tool Z - reason why” > > -Jeremiah > > > On Jun 11, 2025 at 11:48:30 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: > > > Hi, > > I am not saying you said it, but I respectfully mean that contributors, > reviewers, and committers can't feasibly understand and enforce the ASF > guidelines. We would be another link in a chain of people abdicating > responsibility starting with LLM vendors serving up models that reproduce > copyrighted code, then going to ASF legal which gives us guidelines without > the tools to enforce those guidelines, and now we (the PMC) would be doing > the same to contributors, reviewers, and committers. > > I don’t think anyone is going to be able to maintain and enforce a list of > acceptable tools for contributors to the project to stick to. We can’t know > what someone did on their laptop, all we can do is evaluate the code they > submit. > > I agree we might not be able to do a perfect job at any aspect of trying > to make sure that the code we accept is not problematic I some way, but > that doesn't mean we shouldn't try? > > We don't even necessarily need to be that restrictive beyond requiring > tools that make at least some effort not to reproduce large amounts of > copyrighted code that may/may not be license compatible or tools that are > themselves not license compatible. This ends up encompassing most of the > ones people want to use anyways. > > How many people are aware that if you get code from OpenAI directly that > the license isn't ASL compatible, but that if you get it via Microsoft > services that use OpenAI models that it's ASL compatible? It's not in the > ASF guidelines (it was but they removed it!). > > How many people are aware that when people use locally run models there is > no output filtering further increasing the odds of the model reproducing > copyright encumbered code? > > There is a non-zero amount we can do to educate and guide that would be > better than pointing people to the ASF guidelines and leaving it at that. > > The ASF guidelines themselves have suggestions like requiring people to > say if they used AI and then which AI. I don't think it's very useful > beyond checking license compatibility of the AI itself, but that is > something we should be doing so it might as well be documented and included > in the PR text. > > Ariel > > On Mon, Jun 2, 2025, at 7:54 PM, Jeremiah Jordan wrote: > > I don’t think I said we should abdicate responsibility? I said the key > point is that contributors, and more importantly reviewers and committers > understand the ASF guidelines and hold all code to those standards. Any > suspect code should be blocked during review. As Roman says in your quote, > this isn’t about AI, it’s about copyright. If someone submits copyrighted > code to the project, whether an AI generated it or they just grabbed it > from a Google search, it’s on the project to try not to accept it. > I don’t think anyone is going to be able to maintain and enforce a list of > acceptable tools for contributors to the project to stick to. We can’t know > what someone did on their laptop, all we can do is evaluate the code they > submit. > > -Jeremiah > > On Mon, Jun 2, 2025 at 6:39 PM Ariel Weisberg <ar...@weisberg.ws> wrote: > > > Hi, > > As PMC members/committers we aren't supposed to abdicate this to legal or > to contributors. Despite the fact that we aren't equipped to solve this > problem we are supposed to be making sure that code contributed is > non-infringing. > > This is a quotation from Roman Shaposhnik from this legal thread > https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd > > Yes, because you have to. Again -- forget about AI -- if a drive-by > contributor submits a patch that has huge amounts of code stolen from some > existing copyright holder -- it is very much ON YOU as a committer/PMC to > prevent that from happening. > > > We aren't supposed to knowingly allow people to use AI tools that are > known to generate infringing contributions or contributions which are not > license compatible (such as OpenAI terms of use). > > Ariel > On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote: > > > Ultimately it's the contributor's (and committer's) job to ensure that > their contributions meet the bar for acceptance > To me this is the key point. Given how pervasive this stuff is becoming, I > don’t think it’s feasible to make some list of tools and enforce it. Even > without getting into extra tools, IDEs (including IntelliJ) are doing more > and more LLM based code suggestion as time goes on. > I think we should point people to the ASF Guidelines around such tools, > and the guidelines around copyrighted code, and then continue to review > patches with the high standards we have always had in this project. > > -Jeremiah > > On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg <ar...@weisberg.ws> wrote: > > > Hi, > > To clarify are you saying that we should not accept AI generated code > until it has been looked at by a human and then written again with > different "wording" to ensure that it doesn't directly copy anything? > > Or do you mean something else about the quality of "vibe coding" and how > we shouldn't allow it because it makes bad code? Ultimately it's the > contributor's (and committer's) job to ensure that their contributions meet > the bar for acceptance and I don't think we should tell them how to go > about meeting that bar beyond what is needed to address the copyright > concern. > > I agree that the bar set by the Apache guidelines are pretty high. They > are simultaneously impossible and trivial to meet depending on how you > interpret them and we are not very well equipped to interpret them. > > It would have been more straightforward for them to simply say no, but > they didn't opt to do that as if there is some way for PMCs to acceptably > take AI generated contributions. > > Ariel > > On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote: > > fine tuning encourage not reproducing things verbatim > > I think not producing copyrighted output from your training data is a > technically feasible achievement for these vendors so I have a moderate > level of trust they will succeed at it if they say they do it. > > > Some team members and I discussed this in the context of my documentation > patch (which utilized Claude during composition). I conducted an experiment > to pose high-level Cassandra-related questions to a model without > additional context, while adjusting the temperature parameter (tested at > 0.2, 0.5, and 0.8). The results revealed that each test generated content > copied verbatim from a specific non-Apache (and non-DSE) website. I did not > verify whether this content was copyrighted, though it was easily > identifiable through a simple Google search. This occurred as a single > sentence within the generated document, and as I am not a legal expert, I > cannot determine whether this constitutes a significant issue. > > The complexity increases when considering models trained on different > languages, which may translate content into English. In such cases, a > Google search would fail to detect the origin. Is this still considered > plagiarism? Does it violate copyright laws? I am uncertain. > > Similar challenges arise with code generation. For instance, if a model is > trained on a GPL-licensed Python library that implements a novel data > structure, and the model subsequently rewrites this structure in Java, a > Google search is unlikely to identify the source. > > Personally, I do not assume these models will avoid producing copyrighted > material. This doesn’t mean I am against AI at all, but rather reflects > my belief that the requirements set by Apache are not easily “provable” in > such scenarios. > > > My personal opinion is that we should at least consider allow listing a > few specific sources (any vendor that scans output for infringement) and > add that to the PR template and in other locations (readme, web site). > Bonus points if we can set up code scanning (useful for non-AI > contributions!). > > > My perspective, after trying to see what AI can do is the following: > > Strengths > * Generating a preliminary draft of a document and assisting with > iterative revisions > * Documenting individual methods > * Generation of “simple” methods and scripts, provided the underlying > libraries are well-documented in public repositories > * Managing repetitive or procedural tasks, such as “migrating from X to Y” > or “converting serializations to the X interface” > > Limitations > * Producing a fully functional document in a single attempt that meets > merge standards. When documenting Gens.java and Property.java, the output > appeared plausible but contained frequent inaccuracies. > * Addressing complex or ambiguous scenarios (“gossip”), though this > challenge is not unique to AI—Matt Byrd and I tested Claude for > CASSANDRA-20659, where it could identify relevant code but proposed > solutions that risked corrupting production clusters. > * Interpreting large-scale codebases. Beyond approximately 300 lines of > actual code (excluding formatting), performance degrades significantly, > leading to a marked decline in output quality. > > Note: When referring to AI/LLMs, I am not discussing interactions with a > user interface to execute specific tasks, but rather leveraging code agents > like Roo and Aider to provide contextual information to the LLM. > > Given these observations, it remains challenging to determine optimal > practices. In some contexts its very clear to tell that nothing was > taking from external work (e.g., “create a test using our BTree class > that inserts a row with a null column,” “analyze this function’s purpose”). > However, for substantial tasks, the situation becomes more complex. If the > author employed AI as a collaborative tool during “pair programming,” > concerns are not really that different than google searches (unless the > work involves unique elements like introducing new data structures or > indexes). Conversely, if the author “vibe coded” the entire patch, two > primary concerns arise: does the author have writes to the code and whether > its quality aligns with requirements. > > > TL;DR - I am not against AI contributions, but strongly prefer its done as > “pair programing”. My experience with “vibe coding” makes me worry about > the quality of the code, and that the author is less likely to validate > that the code generated is safe to donate. > > This email was generated with the help of AI =) > > > On May 30, 2025, at 3:00 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: > > Hi all, > > It looks like we haven't discussed this much and haven't settled on a > policy for what kinds of AI generated contributions we accept and what > vetting is required for them. > > > https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results > . > > ``` > Given the above, code generated in whole or in part using AI can be > contributed if the contributor ensures that: > > 1. The terms and conditions of the generative AI tool do not place any > restrictions on use of the output that would be inconsistent with the Open > Source Definition. > 2. At least one of the following conditions is met: > 2.1 The output is not copyrightable subject matter (and would not be > even if produced by a human). > 2.2 No third party materials are included in the output. > 2.3 Any third party materials that are included in the output are being > used with permission (e.g., under a compatible open-source license) of the > third party copyright holders and in compliance with the applicable license > terms. > 3. A contributor obtains reasonable certainty that conditions 2.2 or 2.3 > are met if the AI tool itself provides sufficient information about output > that may be similar to training data, or from code scanning results. > ``` > > There is a lot to unpack there, but it seems like any one of 2 needs to be > met, and 3 describes how 2.2 and 2.3 can be satisfied. > > 2.1 is tricky as we are not copyright lawyers, and 2.2 and 2.3 is a pretty > high bar in that it's hard to know if you have met it. Do we have anyone in > the community running any code scanning tools already? > > Here is the JIRA for addition of the generative AI policy: > https://issues.apache.org/jira/browse/LEGAL-631 > Legal mailing list discussion of the policy: > https://lists.apache.org/thread/vw3jf4726yrhovg39mcz1y89mx8j4t8s > Legal mailing list discussion of compliant tools: > https://lists.apache.org/thread/nzyl311q53xhpq99grf6l1h076lgzybr > Legal mailing list discussion about how Open AI terms are not Apache > compatible: > https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 > Hadoop mailing list message hinting that they accept contributions but ask > which tool: > https://lists.apache.org/thread/bgs8x1f9ovrjmhg6b450bz8bt7o43yxj > Spark mailing list message where they have given up on stopping people: > https://lists.apache.org/thread/h6621sxfxcnnpsoyr31x65z207kk80fr > > I didn't see other projects discussing and deciding how to handle these > contributions, but I also didn't check that many of them only Hadoop, > Spark, Druid, Pulsar. I also can't see their PMC mailing list. > > I asked O3 to deep research what is done to avoid producing copyrighted > code: https://chatgpt.com/share/683a2983-dd9c-8009-9a66-425012af840d > > To summarize training deduplicates training so the model is less likely to > spit reproduce it verbatim, prompts and fine tuning encourage not > reproducing things verbatim, the inference is biased to not pick the best > option but some neighboring one encouraging originality, and in some > instances the output is checked to make sure it doesn't match the training > data. So to some extent 2.2 is being done to different degrees depending on > what product you are using. > > It's worth noting that scanning the output can be probabilistic in the > case of say Anthropic and they still recommend code scanning. > > Quite notably Anthropic for its enterprise users indemnifies them against > copyright claims. It's not perfect, but it does mean they have an incentive > to make sure there are fewer copyright claims. We could choose to be picky > and only accept specific sources of LLM generated code based on perceived > safety. > > I think not producing copyrighted output from your training data is a > technically feasible achievement for these vendors so I have a moderate > level of trust they will succeed at it if they say they do it. > > I could send a message to the legal list asking for clarification and a > set of tools, but based on Roman's communication ( > https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd) I think > this is kind of what we get. It's on us to ensure the contributions are > kosher either by code scanning or accepting that the LLM vendors are doing > a good job at avoiding copyrighted output. > > My personal opinion is that we should at least consider allow listing a > few specific sources (any vendor that scans output for infringement) and > add that to the PR template and in other locations (readme, web site). > Bonus points if we can set up code scanning (useful for non-AI > contributions!). > > Regards, > Ariel > > > > > > > >