To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, concrete) opinion.
On Fri, Sep 22, 2023 at 5:59 AM Benedict <bened...@apache.org> wrote: > > 1. my understanding is that with the former the liability rests on the > provider of the lib to ensure it's in compliance with their claims to > copyright > > I highly doubt liability works like that in all jurisdictions, even if it > might in some. I can even think of some historic cases related to Linux > where patent trolls went after users of Linux, though I’m not sure where > that got to and I don’t remember all the details. > > But anyway, none of us are lawyers and we shouldn’t be depending on this > kind of analysis. At minimum we should invite legal to proffer an opinion > on whether dependencies are a valid loophole to the policy. > > > On 22 Sep 2023, at 13:48, J. D. Jordan <jeremiah.jor...@gmail.com> wrote: > > > This Gen AI generated code use thread should probably be its own mailing > list DISCUSS thread? It applies to all source code we take in, and accept > copyright assignment of, not to jars we depend on and not only to vector > related code contributions. > > On Sep 22, 2023, at 7:29 AM, Josh McKenzie <jmcken...@apache.org> wrote: > > > So if we're going to chat about GenAI on this thread here, 2 things: > > 1. A dependency we pull in != a code contribution (I am not a lawyer > but my understanding is that with the former the liability rests on the > provider of the lib to ensure it's in compliance with their claims to > copyright and it's not sticky). Easier to transition to a different dep if > there's something API compatible or similar. > 2. With code contributions we take in, we take on some exposure in > terms of copyright and infringement. git revert can be painful. > > For this thread, here's an excerpt from the ASF policy: > > a recommended practice when using generative AI tooling is to use tools > with features that identify any included content that is similar to parts > of the tool’s training data, as well as the license of that content. > > Given the above, code generated in whole or in part using AI can be > contributed if the contributor ensures that: > > 1. The terms and conditions of the generative AI tool do not place any > restrictions on use of the output that would be inconsistent with the Open > Source Definition (e.g., ChatGPT’s terms are inconsistent). > 2. At least one of the following conditions is met: > 1. The output is not copyrightable subject matter (and would not be > even if produced by a human) > 2. No third party materials are included in the output > 3. Any third party materials that are included in the output are > being used with permission (e.g., under a compatible open source > license) > of the third party copyright holders and in compliance with the > applicable > license terms > 3. A contributor obtain reasonable certainty that conditions 2.2 or > 2.3 are met if the AI tool itself provides sufficient information about > materials that may have been copied, or from code scanning results > 1. E.g. AWS CodeWhisperer recently added a feature that provides > notice and attribution > > When providing contributions authored using generative AI tooling, a > recommended practice is for contributors to indicate the tooling used to > create the contribution. This should be included as a token in the source > control commit message, for example including the phrase “Generated-by > > > I think the real challenge right now is ensuring that the output from an > LLM doesn't include a string of tokens that's identical to something in its > input training dataset if it's trained on non-permissively licensed inputs. > That plus the risk of, at least in the US, the courts landing on the side > of saying that not only is the output of generative AI not copyrightable, > but that there's legal liability on either the users of the tools or the > creators of the models for some kind of copyright infringement. That can be > sticky; if we take PR's that end up with that liability exposure, we end up > in a place where either the foundation could be legally exposed and/or we'd > need to revert some pretty invasive code / changes. > > For example, Microsoft and OpenAI have publicly committed to paying legal > fees for people sued for copyright infringement for using their tools: > https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view > <https://urldefense.com/v3/__https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view__;!!PbtH5S7Ebw!ayp8v3C0XGwLhCQCu_FuLfvUz7V4Jgg5JGVkJGJl6DenfyeGqFvD_RAERDUr7koCoiLAnkz8q3QoF3fBz7fZ$>. > Pretty interesting, and not a step a provider would take in an environment > where things were legally clear and settled. > > So while the usage of these things is apparently incredibly pervasive > right now, "everybody is doing it" is a pretty high risk legal defense. :) > > On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote: > > > > On Thu, 21 Sept 2023 at 10:41, Benedict <bened...@apache.org> wrote: > > > At some point we have to discuss this, and here’s as good a place as any. > There’s a great news article published talking about how generative AI was > used to assist in developing the new vector search feature, which is itself > really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal > policy on use for contributions to the project. This proposal is to include > a dependency, but I’m not sure if that avoids the issue, and I’m equally > uncertain how much this issue is isolated to the dependency (or affects it > at all?) > > Anyway, this is an annoying discussion we need to have at some point, so > raising it here now so we can figure it out. > > [1] > https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/ > <https://urldefense.com/v3/__https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/__;!!PbtH5S7Ebw!fi6r5DJcCCQ5zE54pLuUNDEXRSukUWsbj9dtHaXQX2Fcr-xkwsPUZz4QJu_3z5VOCKTSUIeupeClXoy0$> > [2] https://www.apache.org/legal/generative-tooling.html > > > > My reading of the ASF's GenAI policy is that any generated work in the > jvector library (and cep-30 ?) are not copyrightable, and that makes them > ok for us to include. > > If there was a trace to copyrighted work, or the tooling imposed a > copyright or restrictions, we would then have to take considerations. > > >