I haven't participated much here, but my vote would be basically #1, i.e.
an "allow list" with a clear procedure for expansion.

On Mon, Jun 16, 2025 at 4:05 PM Ariel Weisberg <ar...@weisberg.ws> wrote:

> Hi,
>
> We could, but if the allow list is binding then it's still an allow list
> with some guidance on how to expand the allow list.
>
> If it isn't binding then it's guidance so still option 2 really.
>
> I think the key distinction to find some early consensus on if we do a
> binding allow list or guidance, and then we can iron out the guidance, but
> I think that will be less controversial to work out.
>
> Or option 3 which is not accepting AI generated contributions. I think
> there are some with healthy skepticism of AI generated code, but so far I
> haven't met anyone who wants to forbid it entirely.
>
> Ariel
>
> On Mon, Jun 16, 2025, at 4:54 PM, Josh McKenzie wrote:
>
> Couldn't our official stance be a combination of 1 and 2? i.e. "Here's an
> allow list. If you're using something not on that allow list, here's some
> basic guidance and maybe let us know how you tried to mitigate some of this
> risk so we can update our allow list w/some nuance".
>
> On Mon, Jun 16, 2025, at 4:39 PM, Ariel Weisberg wrote:
>
> Hi,
>
> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>
> Where are you getting this from?  From the OpenAI terms of use:
> https://openai.com/policies/terms-of-use/
>
> Direct from the ASF legal mailing list discussion I linked to in my
> original email calling this out
> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16 It's not
> clear from that thread precisely what they are objecting to and whether it
> has changed (another challenge!), but I believe it's restrictions on what
> you are allowed to do with the output of OpenAI models. And if you get the
> output via other service's it's under a different license and it's fine!
>
> Already we are demonstrating that it is not trivial understand what is and
> isn't allowed
>
> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>
> I still maintain that trying to publish an exhaustive list of acceptable
> tools does not seem reasonable.
> But I agree that giving people guidance is possible.  Maybe having a
> statement in the contribution guidelines along the lines of:
>
> The list doesn't need to be exhaustive. We are not required to accept AI
> generated code at all!
>
> We can make a best effort to just vet the ones that people actually want
> to widely use and refuse everything else and be better off than allowing
> people to use tools that are known not to be license compatible or make
> little/no effort to avoid reproducing large amounts of copyrighted code.
>
> “Make sure your tools do X, here are some that at the time of being added
> to this list did X, Tool A, Tool B …
> Here is a list of tools that at the time of being added to this list did
> not satisfy X. Tool Z - reason why”
>
> I would be fine with this as an outcome. If we voted with multiple options
> it wouldn't be my first choice.
>
> This thread only has 4 participants so far so it's hard to get a signal on
> what people would want if we tried to vote.
>
> David, Scott, anyone else if the options were:
>
>    1. Allow list
>    2. Basic guidance as suggested by Jeremiah, but primarily leave it up
>    to contributor/reviewer
>    3. Do nothing
>    4. My choice isn't here
>
> What would you want?
>
> My vote in choice order is 1,2,3.
>
> Ariel
>
>
> On Wed, Jun 11, 2025, at 3:48 PM, Jeremiah Jordan wrote:
>
>
>  I respectfully mean that contributors, reviewers, and committers can't
> feasibly understand and enforce the ASF guidelines.
>
> If this is true, then the ASF is in a lot of trouble and you should bring
> it up with the ASF board.
> Where are you getting this from?  From the OpenAI terms of use:
> https://openai.com/policies/terms-of-use/
>
> We don't even necessarily need to be that restrictive beyond requiring
> tools that make at least some effort not to reproduce large amounts of
> copyrighted code that may/may not be license compatible or tools that are
> themselves not license compatible. This ends up encompassing most of the
> ones people want to use anyways.
>
>
> There is a non-zero amount we can do to educate and guide that would be
> better than pointing people to the ASF guidelines and leaving it at that.
>
>
> I still maintain that trying to publish an exhaustive list of acceptable
> tools does not seem reasonable.
> But I agree that giving people guidance is possible.  Maybe having a
> statement in the contribution guidelines along the lines of:
> “Make sure your tools do X, here are some that at the time of being added
> to this list did X, Tool A, Tool B …
> Here is a list of tools that at the time of being added to this list did
> not satisfy X. Tool Z - reason why”
>
> -Jeremiah
>
>
> On Jun 11, 2025 at 11:48:30 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>
>
> Hi,
>
> I am not saying you said it, but I respectfully mean that contributors,
> reviewers, and committers can't feasibly understand and enforce the ASF
> guidelines. We would be another link in a chain of people abdicating
> responsibility starting with LLM vendors serving up models that reproduce
> copyrighted code, then going to ASF legal which gives us guidelines without
> the tools to enforce those guidelines, and now we (the PMC) would be doing
> the same to contributors, reviewers, and committers.
>
> I don’t think anyone is going to be able to maintain and enforce a list of
> acceptable tools for contributors to the project to stick to. We can’t know
> what someone did on their laptop, all we can do is evaluate the code they
> submit.
>
> I agree we might not be able to do a perfect job at any aspect of trying
> to make sure that the code we accept is not problematic I some way, but
> that doesn't mean we shouldn't try?
>
> We don't even necessarily need to be that restrictive beyond requiring
> tools that make at least some effort not to reproduce large amounts of
> copyrighted code that may/may not be license compatible or tools that are
> themselves not license compatible. This ends up encompassing most of the
> ones people want to use anyways.
>
> How many people are aware that if you get code from OpenAI directly that
> the license isn't ASL compatible, but that if you get it via Microsoft
> services that use OpenAI models that it's ASL compatible? It's not in the
> ASF guidelines (it was but they removed it!).
>
> How many people are aware that when people use locally run models there is
> no output filtering further increasing the odds of the model reproducing
> copyright encumbered code?
>
> There is a non-zero amount we can do to educate and guide that would be
> better than pointing people to the ASF guidelines and leaving it at that.
>
> The ASF guidelines themselves have suggestions like requiring people to
> say if they used AI and then which AI. I don't think it's very useful
> beyond checking license compatibility of the AI itself, but that is
> something we should be doing so it might as well be documented and included
> in the PR text.
>
> Ariel
>
> On Mon, Jun 2, 2025, at 7:54 PM, Jeremiah Jordan wrote:
>
> I don’t think I said we should abdicate responsibility?  I said the key
> point is that contributors, and more importantly reviewers and committers
> understand the ASF guidelines and hold all code to those standards. Any
> suspect code should be blocked during review. As Roman says in your quote,
> this isn’t about AI, it’s about copyright. If someone submits copyrighted
> code to the project, whether an AI generated it or they just grabbed it
> from a Google search, it’s on the project to try not to accept it.
> I don’t think anyone is going to be able to maintain and enforce a list of
> acceptable tools for contributors to the project to stick to. We can’t know
> what someone did on their laptop, all we can do is evaluate the code they
> submit.
>
> -Jeremiah
>
> On Mon, Jun 2, 2025 at 6:39 PM Ariel Weisberg <ar...@weisberg.ws> wrote:
>
>
> Hi,
>
> As PMC members/committers we aren't supposed to abdicate this to legal or
> to contributors. Despite the fact that we aren't equipped to solve this
> problem we are supposed to be making sure that code contributed is
> non-infringing.
>
> This is a quotation from Roman Shaposhnik from this legal thread
> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd
>
> Yes, because you have to. Again -- forget about AI -- if a drive-by
> contributor submits a patch that has huge amounts of code stolen from some
> existing copyright holder -- it is very much ON YOU as a committer/PMC to
> prevent that from happening.
>
>
> We aren't supposed to knowingly allow people to use AI tools that are
> known to generate infringing contributions or contributions which are not
> license compatible (such as OpenAI terms of use).
>
> Ariel
> On Mon, Jun 2, 2025, at 7:08 PM, Jeremiah Jordan wrote:
>
> > Ultimately it's the contributor's (and committer's) job to ensure that
> their contributions meet the bar for acceptance
> To me this is the key point. Given how pervasive this stuff is becoming, I
> don’t think it’s feasible to make some list of tools and enforce it.  Even
> without getting into extra tools, IDEs (including IntelliJ) are doing more
> and more LLM based code suggestion as time goes on.
> I think we should point people to the ASF Guidelines around such tools,
> and the guidelines around copyrighted code, and then continue to review
> patches with the high standards we have always had in this project.
>
> -Jeremiah
>
> On Mon, Jun 2, 2025 at 5:52 PM Ariel Weisberg <ar...@weisberg.ws> wrote:
>
>
> Hi,
>
> To clarify are you saying that we should not accept AI generated code
> until it has been looked at by a human and then written again with
> different "wording" to ensure that it doesn't directly copy anything?
>
> Or do you mean something else about the quality of "vibe coding" and how
> we shouldn't allow it because it makes bad code? Ultimately it's the
> contributor's (and committer's) job to ensure that their contributions meet
> the bar for acceptance and I don't think we should tell them how to go
> about meeting that bar beyond what is needed to address the copyright
> concern.
>
> I agree that the bar set by the Apache guidelines are pretty high. They
> are simultaneously impossible and trivial to meet depending on how you
> interpret them and we are not very well equipped to interpret them.
>
> It would have been more straightforward for them to simply say no, but
> they didn't opt to do that as if there is some way for PMCs to acceptably
> take AI generated contributions.
>
> Ariel
>
> On Mon, Jun 2, 2025, at 5:03 PM, David Capwell wrote:
>
> fine tuning encourage not reproducing things verbatim
>
> I think not producing copyrighted output from your training data is a
> technically feasible achievement for these vendors so I have a moderate
> level of trust they will succeed at it if they say they do it.
>
>
> Some team members and I discussed this in the context of my documentation
> patch (which utilized Claude during composition). I conducted an experiment
> to pose high-level Cassandra-related questions to a model without
> additional context, while adjusting the temperature parameter (tested at
> 0.2, 0.5, and 0.8). The results revealed that each test generated content
> copied verbatim from a specific non-Apache (and non-DSE) website. I did not
> verify whether this content was copyrighted, though it was easily
> identifiable through a simple Google search. This occurred as a single
> sentence within the generated document, and as I am not a legal expert, I
> cannot determine whether this constitutes a significant issue.
>
> The complexity increases when considering models trained on different
> languages, which may translate content into English. In such cases, a
> Google search would fail to detect the origin. Is this still considered
> plagiarism? Does it violate copyright laws? I am uncertain.
>
> Similar challenges arise with code generation. For instance, if a model is
> trained on a GPL-licensed Python library that implements a novel data
> structure, and the model subsequently rewrites this structure in Java, a
> Google search is unlikely to identify the source.
>
> Personally, I do not assume these models will avoid producing copyrighted
> material. This doesn’t mean I am against AI at all, but rather reflects
> my belief that the requirements set by Apache are not easily “provable” in
> such scenarios.
>
>
> My personal opinion is that we should at least consider allow listing a
> few specific sources (any vendor that scans output for infringement) and
> add that to the PR template and in other locations (readme, web site).
> Bonus points if we can set up code scanning (useful for non-AI
> contributions!).
>
>
> My perspective, after trying to see what AI can do is the following:
>
> Strengths
> * Generating a preliminary draft of a document and assisting with
> iterative revisions
> * Documenting individual methods
> * Generation of “simple” methods and scripts, provided the underlying
> libraries are well-documented in public repositories
> * Managing repetitive or procedural tasks, such as “migrating from X to Y”
> or “converting serializations to the X interface”
>
> Limitations
> * Producing a fully functional document in a single attempt that meets
> merge standards. When documenting Gens.java and Property.java, the output
> appeared plausible but contained frequent inaccuracies.
> * Addressing complex or ambiguous scenarios (“gossip”), though this
> challenge is not unique to AI—Matt Byrd and I tested Claude for
> CASSANDRA-20659, where it could identify relevant code but proposed
> solutions that risked corrupting production clusters.
> * Interpreting large-scale codebases. Beyond approximately 300 lines of
> actual code (excluding formatting), performance degrades significantly,
> leading to a marked decline in output quality.
>
> Note: When referring to AI/LLMs, I am not discussing interactions with a
> user interface to execute specific tasks, but rather leveraging code agents
> like Roo and Aider to provide contextual information to the LLM.
>
> Given these observations, it remains challenging to determine optimal
> practices. In some contexts its very clear to tell that nothing was
> taking from external work (e.g., “create a test using our BTree class
> that inserts a row with a null column,” “analyze this function’s purpose”).
> However, for substantial tasks, the situation becomes more complex. If the
> author employed AI as a collaborative tool during “pair programming,”
> concerns are not really that different than google searches (unless the
> work involves unique elements like introducing new data structures or
> indexes). Conversely, if the author “vibe coded” the entire patch, two
> primary concerns arise: does the author have writes to the code and whether
> its quality aligns with requirements.
>
>
> TL;DR - I am not against AI contributions, but strongly prefer its done as
> “pair programing”.  My experience with “vibe coding” makes me worry about
> the quality of the code, and that the author is less likely to validate
> that the code generated is safe to donate.
>
> This email was generated with the help of AI =)
>
>
> On May 30, 2025, at 3:00 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>
> Hi all,
>
> It looks like we haven't discussed this much and haven't settled on a
> policy for what kinds of AI generated contributions we accept and what
> vetting is required for them.
>
>
> https://www.apache.org/legal/generative-tooling.html#:~:text=Given%20the%20above,code%20scanning%20results
> .
>
> ```
> Given the above, code generated in whole or in part using AI can be
> contributed if the contributor ensures that:
>
> 1. The terms and conditions of the generative AI tool do not place any
> restrictions on use of the output that would be inconsistent with the Open
> Source Definition.
> 2. At least one of the following conditions is met:
>    2.1 The output is not copyrightable subject matter (and would not be
> even if produced by a human).
>    2.2 No third party materials are included in the output.
>    2.3 Any third party materials that are included in the output are being
> used with permission (e.g., under a compatible open-source license) of the
> third party copyright holders and in compliance with the applicable license
> terms.
> 3. A contributor obtains reasonable certainty that conditions 2.2 or 2.3
> are met if the AI tool itself provides sufficient information about output
> that may be similar to training data, or from code scanning results.
> ```
>
> There is a lot to unpack there, but it seems like any one of 2 needs to be
> met, and 3 describes how 2.2 and 2.3 can be satisfied.
>
> 2.1 is tricky as we are not copyright lawyers, and 2.2 and 2.3 is a pretty
> high bar in that it's hard to know if you have met it. Do we have anyone in
> the community running any code scanning tools already?
>
> Here is the JIRA for addition of the generative AI policy:
> https://issues.apache.org/jira/browse/LEGAL-631
> Legal mailing list discussion of the policy:
> https://lists.apache.org/thread/vw3jf4726yrhovg39mcz1y89mx8j4t8s
> Legal mailing list discussion of compliant tools:
> https://lists.apache.org/thread/nzyl311q53xhpq99grf6l1h076lgzybr
> Legal mailing list discussion about how Open AI terms are not Apache
> compatible:
> https://lists.apache.org/thread/lcvxnpf39v22lc3f9t5fo07p19237d16
> Hadoop mailing list message hinting that they accept contributions but ask
> which tool:
> https://lists.apache.org/thread/bgs8x1f9ovrjmhg6b450bz8bt7o43yxj
> Spark mailing list message where they have given up on stopping people:
> https://lists.apache.org/thread/h6621sxfxcnnpsoyr31x65z207kk80fr
>
> I didn't see other projects discussing and deciding how to handle these
> contributions, but I also didn't check that many of them only Hadoop,
> Spark, Druid, Pulsar. I also can't see their PMC mailing list.
>
> I asked O3 to deep research what is done to avoid producing copyrighted
> code: https://chatgpt.com/share/683a2983-dd9c-8009-9a66-425012af840d
>
> To summarize training deduplicates training so the model is less likely to
> spit reproduce it verbatim, prompts and fine tuning encourage not
> reproducing things verbatim, the inference is biased to not pick the best
> option but some neighboring one encouraging originality, and in some
> instances the output is checked to make sure it doesn't match the training
> data. So to some extent 2.2 is being done to different degrees depending on
> what product you are using.
>
> It's worth noting that scanning the output can be probabilistic in the
> case of say Anthropic and they still recommend code scanning.
>
> Quite notably Anthropic for its enterprise users indemnifies them against
> copyright claims. It's not perfect, but it does mean they have an incentive
> to make sure there are fewer copyright claims. We could choose to be picky
> and only accept specific sources of LLM generated code based on perceived
> safety.
>
> I think not producing copyrighted output from your training data is a
> technically feasible achievement for these vendors so I have a moderate
> level of trust they will succeed at it if they say they do it.
>
> I could send a message to the legal list asking for clarification and a
> set of tools, but based on Roman's communication (
> https://lists.apache.org/thread/f6k93xx67pc33o0yhm24j3dpq0323gyd) I think
> this is kind of what we get. It's on us to ensure the contributions are
> kosher either by code scanning or accepting that the LLM vendors are doing
> a good job at avoiding copyrighted output.
>
> My personal opinion is that we should at least consider allow listing a
> few specific sources (any vendor that scans output for infringement) and
> add that to the PR template and in other locations (readme, web site).
> Bonus points if we can set up code scanning (useful for non-AI
> contributions!).
>
> Regards,
> Ariel
>
>
>
>
>
>
>
>

Reply via email to