Re: [DISCUSS] Add JVector as a dependency for CEP-30

Jeff Jirsa Fri, 22 Sep 2023 06:18:10 -0700

To do that, the cassandra PMC can open a legal JIRA and ask for a (durable,
concrete) opinion.



On Fri, Sep 22, 2023 at 5:59 AM Benedict <[email protected]> wrote:

>
>    1. my understanding is that with the former the liability rests on the
>    provider of the lib to ensure it's in compliance with their claims to
>    copyright
>
> I highly doubt liability works like that in all jurisdictions, even if it
> might in some. I can even think of some historic cases related to Linux
> where patent trolls went after users of Linux, though I’m not sure where
> that got to and I don’t remember all the details.
>
> But anyway, none of us are lawyers and we shouldn’t be depending on this
> kind of analysis. At minimum we should invite legal to proffer an opinion
> on whether dependencies are a valid loophole to the policy.
>
>
> On 22 Sep 2023, at 13:48, J. D. Jordan <[email protected]> wrote:
>
> 
> This Gen AI generated code use thread should probably be its own mailing
> list DISCUSS thread?  It applies to all source code we take in, and accept
> copyright assignment of, not to jars we depend on and not only to vector
> related code contributions.
>
> On Sep 22, 2023, at 7:29 AM, Josh McKenzie <[email protected]> wrote:
>
> 
> So if we're going to chat about GenAI on this thread here, 2 things:
>
>    1. A dependency we pull in != a code contribution (I am not a lawyer
>    but my understanding is that with the former the liability rests on the
>    provider of the lib to ensure it's in compliance with their claims to
>    copyright and it's not sticky). Easier to transition to a different dep if
>    there's something API compatible or similar.
>    2. With code contributions we take in, we take on some exposure in
>    terms of copyright and infringement. git revert can be painful.
>
> For this thread, here's an excerpt from the ASF policy:
>
> a recommended practice when using generative AI tooling is to use tools
> with features that identify any included content that is similar to parts
> of the tool’s training data, as well as the license of that content.
>
> Given the above, code generated in whole or in part using AI can be
> contributed if the contributor ensures that:
>
>    1. The terms and conditions of the generative AI tool do not place any
>    restrictions on use of the output that would be inconsistent with the Open
>    Source Definition (e.g., ChatGPT’s terms are inconsistent).
>    2. At least one of the following conditions is met:
>    1. The output is not copyrightable subject matter (and would not be
>       even if produced by a human)
>       2. No third party materials are included in the output
>       3. Any third party materials that are included in the output are
>       being used with permission (e.g., under a compatible open source 
> license)
>       of the third party copyright holders and in compliance with the 
> applicable
>       license terms
>       3. A contributor obtain reasonable certainty that conditions 2.2 or
>    2.3 are met if the AI tool itself provides sufficient information about
>    materials that may have been copied, or from code scanning results
>    1. E.g. AWS CodeWhisperer recently added a feature that provides
>       notice and attribution
>
> When providing contributions authored using generative AI tooling, a
> recommended practice is for contributors to indicate the tooling used to
> create the contribution. This should be included as a token in the source
> control commit message, for example including the phrase “Generated-by
>
>
> I think the real challenge right now is ensuring that the output from an
> LLM doesn't include a string of tokens that's identical to something in its
> input training dataset if it's trained on non-permissively licensed inputs.
> That plus the risk of, at least in the US, the courts landing on the side
> of saying that not only is the output of generative AI not copyrightable,
> but that there's legal liability on either the users of the tools or the
> creators of the models for some kind of copyright infringement. That can be
> sticky; if we take PR's that end up with that liability exposure, we end up
> in a place where either the foundation could be legally exposed and/or we'd
> need to revert some pretty invasive code / changes.
>
> For example, Microsoft and OpenAI have publicly committed to paying legal
> fees for people sued for copyright infringement for using their tools:
> https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view
> <https://urldefense.com/v3/__https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view__;!!PbtH5S7Ebw!ayp8v3C0XGwLhCQCu_FuLfvUz7V4Jgg5JGVkJGJl6DenfyeGqFvD_RAERDUr7koCoiLAnkz8q3QoF3fBz7fZ$>.
> Pretty interesting, and not a step a provider would take in an environment
> where things were legally clear and settled.
>
> So while the usage of these things is apparently incredibly pervasive
> right now, "everybody is doing it" is a pretty high risk legal defense. :)
>
> On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:
>
>
>
> On Thu, 21 Sept 2023 at 10:41, Benedict <[email protected]> wrote:
>
>
> At some point we have to discuss this, and here’s as good a place as any.
> There’s a great news article published talking about how generative AI was
> used to assist in developing the new vector search feature, which is itself
> really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal
> policy on use for contributions to the project. This proposal is to include
> a dependency, but I’m not sure if that avoids the issue, and I’m equally
> uncertain how much this issue is isolated to the dependency (or affects it
> at all?)
>
> Anyway, this is an annoying discussion we need to have at some point, so
> raising it here now so we can figure it out.
>
> [1]
> https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/
> <https://urldefense.com/v3/__https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/__;!!PbtH5S7Ebw!fi6r5DJcCCQ5zE54pLuUNDEXRSukUWsbj9dtHaXQX2Fcr-xkwsPUZz4QJu_3z5VOCKTSUIeupeClXoy0$>
> [2] https://www.apache.org/legal/generative-tooling.html
>
>
>
> My reading of the ASF's GenAI policy is that any generated work in the
> jvector library (and cep-30 ?) are not copyrightable, and that makes them
> ok for us to include.
>
> If there was a trace to copyrighted work, or the tooling imposed a
> copyright or restrictions, we would then have to take considerations.
>
>
>

Re: [DISCUSS] Add JVector as a dependency for CEP-30

Reply via email to