Re: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-10-26 Thread Mick Semb Wever
Legal (Roman) helped clarify the situation for us, see comment on LEGAL-656.

We have the comments on CASSANDRA-18715 confirming ownership of both the
work being contributed and that belonging in the jvector library.

For our (downstream) users and their companies, these legal guarantees are
important and we are respectful that many folk place trust in the ASF's
projects because of them.  It is appreciated that it's brought to light and
dealt with, and I learnt a few things along the way.




On Wed, 25 Oct 2023 at 00:12, Benedict  wrote:

> [LEGAL-656] Application of Generative AI policy to dependencies - ASF JIRA
> <https://issues.apache.org/jira/browse/LEGAL-656>
> issues.apache.org <https://issues.apache.org/jira/browse/LEGAL-656>
> [image: fav-jsw.png] <https://issues.apache.org/jira/browse/LEGAL-656>
> <https://issues.apache.org/jira/browse/LEGAL-656>
>
> Legal’s opinion is that this is not an acceptable workaround to the policy.
>
> On 22 Sep 2023, at 23:51, German Eichberger via dev <
> dev@cassandra.apache.org> wrote:
>
> 
> +1 with taking it to legal
>
> As anyone else I enjoy speculating about legal stuff and I think for jars
> you probably need possible deniability aka no paper trail that we
> knowingly... but that horse is out of the barn. So really interested in
> what legal says 
>
> If you can stomach non Java here is an alternate DiskANN implementation: 
> microsoft/DiskANN:
> Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate
> Nearest Neighbor Search (github.com)
> <https://github.com/microsoft/DiskANN>
>
> Thanks,
> German
>
> ------
> *From:* Josh McKenzie 
> *Sent:* Friday, September 22, 2023 7:43 AM
> *To:* dev 
> *Subject:* [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30
>
>
> I highly doubt liability works like that in all jurisdictions
>
> That's a fantastic point. When speculating there, I overlooked the fact
> that there are literally dozens of legal jurisdictions in which this
> project is used and the foundation operates.
>
> As a PMC let's take this to legal.
>
> On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote:
>
> To do that, the cassandra PMC can open a legal JIRA and ask for a
> (durable, concrete) opinion.
>
>
> On Fri, Sep 22, 2023 at 5:59 AM Benedict  wrote:
>
>
>
>1. my understanding is that with the former the liability rests on the
>provider of the lib to ensure it's in compliance with their claims to
>copyright
>
> I highly doubt liability works like that in all jurisdictions, even if it
> might in some. I can even think of some historic cases related to Linux
> where patent trolls went after users of Linux, though I’m not sure where
> that got to and I don’t remember all the details.
>
> But anyway, none of us are lawyers and we shouldn’t be depending on this
> kind of analysis. At minimum we should invite legal to proffer an opinion
> on whether dependencies are a valid loophole to the policy.
>
>
>
> On 22 Sep 2023, at 13:48, J. D. Jordan  wrote:
>
> 
>
> This Gen AI generated code use thread should probably be its own mailing
> list DISCUSS thread?  It applies to all source code we take in, and accept
> copyright assignment of, not to jars we depend on and not only to vector
> related code contributions.
>
> On Sep 22, 2023, at 7:29 AM, Josh McKenzie  wrote:
>
> 
> So if we're going to chat about GenAI on this thread here, 2 things:
>
>1. A dependency we pull in != a code contribution (I am not a lawyer
>but my understanding is that with the former the liability rests on the
>provider of the lib to ensure it's in compliance with their claims to
>copyright and it's not sticky). Easier to transition to a different dep if
>there's something API compatible or similar.
>2. With code contributions we take in, we take on some exposure in
>terms of copyright and infringement. git revert can be painful.
>
> For this thread, here's an excerpt from the ASF policy:
>
> a recommended practice when using generative AI tooling is to use tools
> with features that identify any included content that is similar to parts
> of the tool’s training data, as well as the license of that content.
>
> Given the above, code generated in whole or in part using AI can be
> contributed if the contributor ensures that:
>
>1. The terms and conditions of the generative AI tool do not place any
>restrictions on use of the output that would be inconsistent with the Open
>Source Definition (e.g., ChatGPT’s terms are inconsistent).
>2. At least one of the following conditions is met:
>1. The output is not copyrightable subje

Re: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-10-24 Thread Benedict
[LEGAL-656] Application of Generative AI policy to dependencies - ASF JIRAissues.apache.orgLegal’s opinion is that this is not an acceptable workaround to the policy.On 22 Sep 2023, at 23:51, German Eichberger via dev  wrote:






+1 with taking it to legal




As anyone else I enjoy speculating about legal stuff and I think for jars you probably need possible deniability aka no paper trail that we knowingly... but that horse is out of the barn. So really interested in what legal says





If you can stomach non Java here is an alternate DiskANN implementation: 
microsoft/DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search (github.com)




Thanks,

German





From: Josh McKenzie 
Sent: Friday, September 22, 2023 7:43 AM
To: dev 
Subject: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30
 


I highly doubt liability works like that in all jurisdictions

That's a fantastic point. When speculating there, I overlooked the fact that there are literally dozens of legal jurisdictions in which this project is used and the foundation operates.


As a PMC let's take this to legal.


On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote:

To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, concrete) opinion.




On Fri, Sep 22, 2023 at 5:59 AM Benedict <bened...@apache.org> wrote:






my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright



I highly doubt liability works like that in all jurisdictions, even if it might in some. I can even think of some historic cases related to Linux where patent trolls went after users of Linux, though I’m not sure where that got
 to and I don’t remember all the details.


But anyway, none of us are lawyers and we shouldn’t be depending on this kind of analysis. At minimum we should invite legal to proffer an opinion on whether dependencies are a valid loophole to the policy.







On 22 Sep 2023, at 13:48, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:





This Gen AI generated code use thread should probably be its own mailing list DISCUSS thread?  It applies to all source code we take in, and accept copyright assignment of, not to jars we depend on and not only to vector related
 code contributions.



On Sep 22, 2023, at 7:29 AM, Josh McKenzie <jmcken...@apache.org> wrote:



So if we're going to chat about GenAI on this thread here, 2 things:

A dependency we pull in != a code contribution (I am not a lawyer but my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright and it's not
 sticky). Easier to transition to a different dep if there's something API compatible or similar.
With code contributions we take in, we take on some exposure in terms of copyright and infringement. git revert can be painful.

For this thread, here's an excerpt from the ASF policy:


a recommended practice when using generative AI tooling is to use tools with features that identify any included content that is similar to parts of the tool’s training data, as well as the license
 of that content.

Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that:


The terms and conditions of the generative AI tool do not place
 any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).

At least one of the following conditions
 is met:


The output is not copyrightable subject matter (and would not
 be even if produced by a human)

No third party materials are included in the output

Any third party materials that are included in the output are
 being used with permission (e.g., under a compatible open source license) of the third party copyright holders and in compliance with the applicable license terms


A contributor obtain reasonable
 certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about materials that may have been copied, or from code scanning results


E.g. AWS CodeWhisperer recently added a feature that provides
 notice and attribution



When providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token
 in the source control commit message, for example including the phrase “Generated-by



I think the real challenge right now is ensuring that the output from an LLM doesn't include a string of tokens that's identical to something in its input training dataset if it's trained on non-permissively licensed inputs. That
 plus the risk of, at least in the US, the courts landing on the side of saying that not only is the output of generative AI not copyrightable, but that there's legal l

Re: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread German Eichberger via dev
+1 with taking it to legal

As anyone else I enjoy speculating about legal stuff and I think for jars you 
probably need possible deniability aka no paper trail that we knowingly... but 
that horse is out of the barn. So really interested in what legal says 

If you can stomach non Java here is an alternate DiskANN implementation: 
microsoft/DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and 
Filtered Approximate Nearest Neighbor Search 
(github.com)<https://github.com/microsoft/DiskANN>

Thanks,
German


From: Josh McKenzie 
Sent: Friday, September 22, 2023 7:43 AM
To: dev 
Subject: [EXTERNAL] Re: [DISCUSS] Add JVector as a dependency for CEP-30

I highly doubt liability works like that in all jurisdictions
That's a fantastic point. When speculating there, I overlooked the fact that 
there are literally dozens of legal jurisdictions in which this project is used 
and the foundation operates.

As a PMC let's take this to legal.

On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote:
To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, 
concrete) opinion.


On Fri, Sep 22, 2023 at 5:59 AM Benedict 
mailto:bened...@apache.org>> wrote:


  1.  my understanding is that with the former the liability rests on the 
provider of the lib to ensure it's in compliance with their claims to copyright

I highly doubt liability works like that in all jurisdictions, even if it might 
in some. I can even think of some historic cases related to Linux where patent 
trolls went after users of Linux, though I’m not sure where that got to and I 
don’t remember all the details.

But anyway, none of us are lawyers and we shouldn’t be depending on this kind 
of analysis. At minimum we should invite legal to proffer an opinion on whether 
dependencies are a valid loophole to the policy.



On 22 Sep 2023, at 13:48, J. D. Jordan 
mailto:jeremiah.jor...@gmail.com>> wrote:


This Gen AI generated code use thread should probably be its own mailing list 
DISCUSS thread?  It applies to all source code we take in, and accept copyright 
assignment of, not to jars we depend on and not only to vector related code 
contributions.

On Sep 22, 2023, at 7:29 AM, Josh McKenzie 
mailto:jmcken...@apache.org>> wrote:

So if we're going to chat about GenAI on this thread here, 2 things:

  1.  A dependency we pull in != a code contribution (I am not a lawyer but my 
understanding is that with the former the liability rests on the provider of 
the lib to ensure it's in compliance with their claims to copyright and it's 
not sticky). Easier to transition to a different dep if there's something API 
compatible or similar.
  2.  With code contributions we take in, we take on some exposure in terms of 
copyright and infringement. git revert can be painful.

For this thread, here's an excerpt from the ASF policy:

a recommended practice when using generative AI tooling is to use tools with 
features that identify any included content that is similar to parts of the 
tool’s training data, as well as the license of that content.

Given the above, code generated in whole or in part using AI can be contributed 
if the contributor ensures that:

  1.  The terms and conditions of the generative AI tool do not place any 
restrictions on use of the output that would be inconsistent with the Open 
Source Definition (e.g., ChatGPT’s terms are inconsistent).
  2.
At least one of the following conditions is met:
 *   The output is not copyrightable subject matter (and would not be even 
if produced by a human)
 *   No third party materials are included in the output
 *   Any third party materials that are included in the output are being 
used with permission (e.g., under a compatible open source license) of the 
third party copyright holders and in compliance with the applicable license 
terms
  3.
A contributor obtain reasonable certainty that conditions 2.2 or 2.3 are met if 
the AI tool itself provides sufficient information about materials that may 
have been copied, or from code scanning results
 *   E.g. AWS CodeWhisperer recently added a feature that provides notice 
and attribution

When providing contributions authored using generative AI tooling, a 
recommended practice is for contributors to indicate the tooling used to create 
the contribution. This should be included as a token in the source control 
commit message, for example including the phrase “Generated-by

I think the real challenge right now is ensuring that the output from an LLM 
doesn't include a string of tokens that's identical to something in its input 
training dataset if it's trained on non-permissively licensed inputs. That plus 
the risk of, at least in the US, the courts landing on the side of saying that 
not only is the output of generative AI not copyrightable, but that there's 
legal liability on either the users of the tools or the creators of the models 
for some kind o

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Mike Adamson
> For my understanding, isn’t it gonna be an issue to be copyrighted also
to a single person? For the same reasons?

This was partly why I asked. I did a random check of libraries that are
definite dependencies (netty, guava) and both contain author copyrights.

On Fri, 22 Sept 2023, 16:01 Ekaterina Dimitrova, 
wrote:

> For my understanding, isn’t it gonna be an issue to be copyrighted also to
> a single person? For the same reasons?
>
> On Fri, 22 Sep 2023 at 7:59, Mick Semb Wever  wrote:
>
>>
>>
>> Just for my understanding on this. Is the issue that the code has a
>>> copyright header on it or that it is copyright to a corporate entity?
>>>
>>
>>
>> The potential issue here is about dependence upon one vendor (or
>> commercial actor).
>> If the project is not usable without a specific piece of work (library)
>> that is controlled and maintained elsewhere, and exercising our freedom to
>> rewrite/fork is difficult, the project isn't really independent.  Being
>> independent is an important tenant for ASF projects.
>>
>> I don't see this being an issue with jamm or jvector.  But I do think
>> it's important to check.
>>
>>


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Ekaterina Dimitrova
For my understanding, isn’t it gonna be an issue to be copyrighted also to
a single person? For the same reasons?

On Fri, 22 Sep 2023 at 7:59, Mick Semb Wever  wrote:

>
>
> Just for my understanding on this. Is the issue that the code has a
>> copyright header on it or that it is copyright to a corporate entity?
>>
>
>
> The potential issue here is about dependence upon one vendor (or
> commercial actor).
> If the project is not usable without a specific piece of work (library)
> that is controlled and maintained elsewhere, and exercising our freedom to
> rewrite/fork is difficult, the project isn't really independent.  Being
> independent is an important tenant for ASF projects.
>
> I don't see this being an issue with jamm or jvector.  But I do think it's
> important to check.
>
>


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Josh McKenzie
> I highly doubt liability works like that in all jurisdictions
That's a fantastic point. When speculating there, I overlooked the fact that 
there are literally dozens of legal jurisdictions in which this project is used 
and the foundation operates.

As a PMC let's take this to legal.

On Fri, Sep 22, 2023, at 9:16 AM, Jeff Jirsa wrote:
> To do that, the cassandra PMC can open a legal JIRA and ask for a (durable, 
> concrete) opinion.
> 
> 
> On Fri, Sep 22, 2023 at 5:59 AM Benedict  wrote:
>> 
  1. my understanding is that with the former the liability rests on the 
 provider of the lib to ensure it's in compliance with their claims to 
 copyright
>> I highly doubt liability works like that in all jurisdictions, even if it 
>> might in some. I can even think of some historic cases related to Linux 
>> where patent trolls went after users of Linux, though I’m not sure where 
>> that got to and I don’t remember all the details.
>> 
>> But anyway, none of us are lawyers and we shouldn’t be depending on this 
>> kind of analysis. At minimum we should invite legal to proffer an opinion on 
>> whether dependencies are a valid loophole to the policy.
>> 
>> 
>> 
>>> On 22 Sep 2023, at 13:48, J. D. Jordan  wrote:
>>> 
>>> 
>>> This Gen AI generated code use thread should probably be its own mailing 
>>> list DISCUSS thread?  It applies to all source code we take in, and accept 
>>> copyright assignment of, not to jars we depend on and not only to vector 
>>> related code contributions.
>>> 
 On Sep 22, 2023, at 7:29 AM, Josh McKenzie  wrote:
 
 So if we're going to chat about GenAI on this thread here, 2 things:
  1. A dependency we pull in != a code contribution (I am not a lawyer but 
 my understanding is that with the former the liability rests on the 
 provider of the lib to ensure it's in compliance with their claims to 
 copyright and it's not sticky). Easier to transition to a different dep if 
 there's something API compatible or similar.
  2. With code contributions we take in, we take on some exposure in terms 
 of copyright and infringement. git revert can be painful.
 For this thread, here's an excerpt from the ASF policy:
> a recommended practice when using generative AI tooling is to use tools 
> with features that identify any included content that is similar to parts 
> of the tool’s training data, as well as the license of that content.
> 
> Given the above, code generated in whole or in part using AI can be 
> contributed if the contributor ensures that:
> 
>  1. The terms and conditions of the generative AI tool do not place any 
> restrictions on use of the output that would be inconsistent with the 
> Open Source Definition (e.g., ChatGPT’s terms are inconsistent).
>  2. At least one of the following conditions is met:
>1. The output is not copyrightable subject matter (and would not be 
> even if produced by a human)
>2. No third party materials are included in the output
>3. Any third party materials that are included in the output are being 
> used with permission (e.g., under a compatible open source license) of 
> the third party copyright holders and in compliance with the applicable 
> license terms
>  3. A contributor obtain reasonable certainty that conditions 2.2 or 2.3 
> are met if the AI tool itself provides sufficient information about 
> materials that may have been copied, or from code scanning results
>1. E.g. AWS CodeWhisperer recently added a feature that provides 
> notice and attribution
> When providing contributions authored using generative AI tooling, a 
> recommended practice is for contributors to indicate the tooling used to 
> create the contribution. This should be included as a token in the source 
> control commit message, for example including the phrase “Generated-by
> 
 
 I think the real challenge right now is ensuring that the output from an 
 LLM doesn't include a string of tokens that's identical to something in 
 its input training dataset if it's trained on non-permissively licensed 
 inputs. That plus the risk of, at least in the US, the courts landing on 
 the side of saying that not only is the output of generative AI not 
 copyrightable, but that there's legal liability on either the users of the 
 tools or the creators of the models for some kind of copyright 
 infringement. That can be sticky; if we take PR's that end up with that 
 liability exposure, we end up in a place where either the foundation could 
 be legally exposed and/or we'd need to revert some pretty invasive code / 
 changes.
 
 For example, Microsoft and OpenAI have publicly committed to paying legal 
 fees for people sued for copyright infringement for using their tools: 
 

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Jeff Jirsa
To do that, the cassandra PMC can open a legal JIRA and ask for a (durable,
concrete) opinion.


On Fri, Sep 22, 2023 at 5:59 AM Benedict  wrote:

>
>1. my understanding is that with the former the liability rests on the
>provider of the lib to ensure it's in compliance with their claims to
>copyright
>
> I highly doubt liability works like that in all jurisdictions, even if it
> might in some. I can even think of some historic cases related to Linux
> where patent trolls went after users of Linux, though I’m not sure where
> that got to and I don’t remember all the details.
>
> But anyway, none of us are lawyers and we shouldn’t be depending on this
> kind of analysis. At minimum we should invite legal to proffer an opinion
> on whether dependencies are a valid loophole to the policy.
>
>
> On 22 Sep 2023, at 13:48, J. D. Jordan  wrote:
>
> 
> This Gen AI generated code use thread should probably be its own mailing
> list DISCUSS thread?  It applies to all source code we take in, and accept
> copyright assignment of, not to jars we depend on and not only to vector
> related code contributions.
>
> On Sep 22, 2023, at 7:29 AM, Josh McKenzie  wrote:
>
> 
> So if we're going to chat about GenAI on this thread here, 2 things:
>
>1. A dependency we pull in != a code contribution (I am not a lawyer
>but my understanding is that with the former the liability rests on the
>provider of the lib to ensure it's in compliance with their claims to
>copyright and it's not sticky). Easier to transition to a different dep if
>there's something API compatible or similar.
>2. With code contributions we take in, we take on some exposure in
>terms of copyright and infringement. git revert can be painful.
>
> For this thread, here's an excerpt from the ASF policy:
>
> a recommended practice when using generative AI tooling is to use tools
> with features that identify any included content that is similar to parts
> of the tool’s training data, as well as the license of that content.
>
> Given the above, code generated in whole or in part using AI can be
> contributed if the contributor ensures that:
>
>1. The terms and conditions of the generative AI tool do not place any
>restrictions on use of the output that would be inconsistent with the Open
>Source Definition (e.g., ChatGPT’s terms are inconsistent).
>2. At least one of the following conditions is met:
>1. The output is not copyrightable subject matter (and would not be
>   even if produced by a human)
>   2. No third party materials are included in the output
>   3. Any third party materials that are included in the output are
>   being used with permission (e.g., under a compatible open source 
> license)
>   of the third party copyright holders and in compliance with the 
> applicable
>   license terms
>   3. A contributor obtain reasonable certainty that conditions 2.2 or
>2.3 are met if the AI tool itself provides sufficient information about
>materials that may have been copied, or from code scanning results
>1. E.g. AWS CodeWhisperer recently added a feature that provides
>   notice and attribution
>
> When providing contributions authored using generative AI tooling, a
> recommended practice is for contributors to indicate the tooling used to
> create the contribution. This should be included as a token in the source
> control commit message, for example including the phrase “Generated-by
>
>
> I think the real challenge right now is ensuring that the output from an
> LLM doesn't include a string of tokens that's identical to something in its
> input training dataset if it's trained on non-permissively licensed inputs.
> That plus the risk of, at least in the US, the courts landing on the side
> of saying that not only is the output of generative AI not copyrightable,
> but that there's legal liability on either the users of the tools or the
> creators of the models for some kind of copyright infringement. That can be
> sticky; if we take PR's that end up with that liability exposure, we end up
> in a place where either the foundation could be legally exposed and/or we'd
> need to revert some pretty invasive code / changes.
>
> For example, Microsoft and OpenAI have publicly committed to paying legal
> fees for people sued for copyright infringement for using their tools:
> https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view
> .
> Pretty interesting, and not a step a provider would take in an environment
> where things were legally clear and settled.
>
> So while the usage of these things is apparently incredibly pervasive
> right now, "everybody is doing it" is a 

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Benedict
my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyrightI highly doubt liability works like that in all jurisdictions, even if it might in some. I can even think of some historic cases related to Linux where patent trolls went after users of Linux, though I’m not sure where that got to and I don’t remember all the details.But anyway, none of us are lawyers and we shouldn’t be depending on this kind of analysis. At minimum we should invite legal to proffer an opinion on whether dependencies are a valid loophole to the policy.On 22 Sep 2023, at 13:48, J. D. Jordan  wrote:This Gen AI generated code use thread should probably be its own mailing list DISCUSS thread?  It applies to all source code we take in, and accept copyright assignment of, not to jars we depend on and not only to vector related code contributions.On Sep 22, 2023, at 7:29 AM, Josh McKenzie  wrote:So if we're going to chat about GenAI on this thread here, 2 things:A dependency we pull in != a code contribution (I am not a lawyer but my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright and it's not sticky). Easier to transition to a different dep if there's something API compatible or similar.With code contributions we take in, we take on some exposure in terms of copyright and infringement. git revert can be painful.For this thread, here's an excerpt from the ASF policy:a recommended practice when using generative AI tooling is to use tools with features that identify any included content that is similar to parts of the tool’s training data, as well as the license of that content.Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that:The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).At least one of the following conditions is met:The output is not copyrightable subject matter (and would not be even if produced by a human)No third party materials are included in the outputAny third party materials that are included in the output are being used with permission (e.g., under a compatible open source license) of the third party copyright holders and in compliance with the applicable license termsA contributor obtain reasonable certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about materials that may have been copied, or from code scanning resultsE.g. AWS CodeWhisperer recently added a feature that provides notice and attributionWhen providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token in the source control commit message, for example including the phrase “Generated-byI think the real challenge right now is ensuring that the output from an LLM doesn't include a string of tokens that's identical to something in its input training dataset if it's trained on non-permissively licensed inputs. That plus the risk of, at least in the US, the courts landing on the side of saying that not only is the output of generative AI not copyrightable, but that there's legal liability on either the users of the tools or the creators of the models for some kind of copyright infringement. That can be sticky; if we take PR's that end up with that liability exposure, we end up in a place where either the foundation could be legally exposed and/or we'd need to revert some pretty invasive code / changes.For example, Microsoft and OpenAI have publicly committed to paying legal fees for people sued for copyright infringement for using their tools: https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view. Pretty interesting, and not a step a provider would take in an environment where things were legally clear and settled.So while the usage of these things is apparently incredibly pervasive right now, "everybody is doing it" is a pretty high risk legal defense. :)On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:On Thu, 21 Sept 2023 at 10:41, Benedict  wrote:At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated to the dependency (or affects it at all?)Anyway, this is an annoying 

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread J. D. Jordan
This Gen AI generated code use thread should probably be its own mailing list DISCUSS thread?  It applies to all source code we take in, and accept copyright assignment of, not to jars we depend on and not only to vector related code contributions.On Sep 22, 2023, at 7:29 AM, Josh McKenzie  wrote:So if we're going to chat about GenAI on this thread here, 2 things:A dependency we pull in != a code contribution (I am not a lawyer but my understanding is that with the former the liability rests on the provider of the lib to ensure it's in compliance with their claims to copyright and it's not sticky). Easier to transition to a different dep if there's something API compatible or similar.With code contributions we take in, we take on some exposure in terms of copyright and infringement. git revert can be painful.For this thread, here's an excerpt from the ASF policy:a recommended practice when using generative AI tooling is to use tools with features that identify any included content that is similar to parts of the tool’s training data, as well as the license of that content.Given the above, code generated in whole or in part using AI can be contributed if the contributor ensures that:The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).At least one of the following conditions is met:The output is not copyrightable subject matter (and would not be even if produced by a human)No third party materials are included in the outputAny third party materials that are included in the output are being used with permission (e.g., under a compatible open source license) of the third party copyright holders and in compliance with the applicable license termsA contributor obtain reasonable certainty that conditions 2.2 or 2.3 are met if the AI tool itself provides sufficient information about materials that may have been copied, or from code scanning resultsE.g. AWS CodeWhisperer recently added a feature that provides notice and attributionWhen providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token in the source control commit message, for example including the phrase “Generated-byI think the real challenge right now is ensuring that the output from an LLM doesn't include a string of tokens that's identical to something in its input training dataset if it's trained on non-permissively licensed inputs. That plus the risk of, at least in the US, the courts landing on the side of saying that not only is the output of generative AI not copyrightable, but that there's legal liability on either the users of the tools or the creators of the models for some kind of copyright infringement. That can be sticky; if we take PR's that end up with that liability exposure, we end up in a place where either the foundation could be legally exposed and/or we'd need to revert some pretty invasive code / changes.For example, Microsoft and OpenAI have publicly committed to paying legal fees for people sued for copyright infringement for using their tools: https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view. Pretty interesting, and not a step a provider would take in an environment where things were legally clear and settled.So while the usage of these things is apparently incredibly pervasive right now, "everybody is doing it" is a pretty high risk legal defense. :)On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:On Thu, 21 Sept 2023 at 10:41, Benedict  wrote:At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated to the dependency (or affects it at all?)Anyway, this is an annoying discussion we need to have at some point, so raising it here now so we can figure it out.[1] https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/[2] https://www.apache.org/legal/generative-tooling.htmlMy reading of the ASF's GenAI policy is that any generated work in the jvector library (and cep-30 ?) are not copyrightable, and that makes them ok for us to include.If there was a trace to copyrighted work, or the tooling imposed a copyright or restrictions, we would then have to take considerations.

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Josh McKenzie
So if we're going to chat about GenAI on this thread here, 2 things:
 1. A dependency we pull in != a code contribution (I am not a lawyer but my 
understanding is that with the former the liability rests on the provider of 
the lib to ensure it's in compliance with their claims to copyright and it's 
not sticky). Easier to transition to a different dep if there's something API 
compatible or similar.
 2. With code contributions we take in, we take on some exposure in terms of 
copyright and infringement. git revert can be painful.
For this thread, here's an excerpt from the ASF policy:
> a recommended practice when using generative AI tooling is to use tools with 
> features that identify any included content that is similar to parts of the 
> tool’s training data, as well as the license of that content.
> 
> Given the above, code generated in whole or in part using AI can be 
> contributed if the contributor ensures that:
> 
>  1. The terms and conditions of the generative AI tool do not place any 
> restrictions on use of the output that would be inconsistent with the Open 
> Source Definition (e.g., ChatGPT’s terms are inconsistent).
>  2. At least one of the following conditions is met:
>1. The output is not copyrightable subject matter (and would not be even 
> if produced by a human)
>2. No third party materials are included in the output
>3. Any third party materials that are included in the output are being 
> used with permission (e.g., under a compatible open source license) of the 
> third party copyright holders and in compliance with the applicable license 
> terms
>  3. A contributor obtain reasonable certainty that conditions 2.2 or 2.3 are 
> met if the AI tool itself provides sufficient information about materials 
> that may have been copied, or from code scanning results
>1. E.g. AWS CodeWhisperer recently added a feature that provides notice 
> and attribution
> When providing contributions authored using generative AI tooling, a 
> recommended practice is for contributors to indicate the tooling used to 
> create the contribution. This should be included as a token in the source 
> control commit message, for example including the phrase “Generated-by
> 

I think the real challenge right now is ensuring that the output from an LLM 
doesn't include a string of tokens that's identical to something in its input 
training dataset if it's trained on non-permissively licensed inputs. That plus 
the risk of, at least in the US, the courts landing on the side of saying that 
not only is the output of generative AI not copyrightable, but that there's 
legal liability on either the users of the tools or the creators of the models 
for some kind of copyright infringement. That can be sticky; if we take PR's 
that end up with that liability exposure, we end up in a place where either the 
foundation could be legally exposed and/or we'd need to revert some pretty 
invasive code / changes.

For example, Microsoft and OpenAI have publicly committed to paying legal fees 
for people sued for copyright infringement for using their tools: 
https://www.verdict.co.uk/microsoft-to-pay-legal-fees-for-customers-sued-while-using-its-ai-products/?cf-view.
 Pretty interesting, and not a step a provider would take in an environment 
where things were legally clear and settled.

So while the usage of these things is apparently incredibly pervasive right 
now, "everybody is doing it" is a pretty high risk legal defense. :)

On Fri, Sep 22, 2023, at 8:04 AM, Mick Semb Wever wrote:
> 
> 
> On Thu, 21 Sept 2023 at 10:41, Benedict  wrote:
>> 
>> At some point we have to discuss this, and here’s as good a place as any. 
>> There’s a great news article published talking about how generative AI was 
>> used to assist in developing the new vector search feature, which is itself 
>> really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal 
>> policy on use for contributions to the project. This proposal is to include 
>> a dependency, but I’m not sure if that avoids the issue, and I’m equally 
>> uncertain how much this issue is isolated to the dependency (or affects it 
>> at all?)
>> 
>> Anyway, this is an annoying discussion we need to have at some point, so 
>> raising it here now so we can figure it out.
>> 
>> [1] 
>> https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/
>>  
>> 
>> [2] https://www.apache.org/legal/generative-tooling.html
>> 
> 
> 
> My reading of the ASF's GenAI policy is that any generated work in the 
> jvector library (and cep-30 ?) are not copyrightable, and that makes them ok 
> for us to include.
> 
> If there was a trace to copyrighted work, or the tooling imposed a copyright 
> or restrictions, we would then have to take 

Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Benedict
My reading is quite different, in fact it is quite explicit that e.g. ChatGPT is forbidden from use, whereas AWS CodeWhisperer may be permitted depending on the attribution.I assume you are reading clause 2.1, but this requires that work "would not be [copyrightable] even if produced by a human” which is clearly not the case for most code.I suspect most generated code is forbidden in practice. Either way, the portions of any contribution produced by the code assistant must be included in a separate commit with the tooling used clearly marked in the commit, including any source attribution. This is likely a challenging task to undertake retrospectively, and we may need advice on how to proceed unless there is an audit trail of some kind that can be followed to ensure this is done accurately - particularly since multiple generative code tools appear to have been used in the production of this work.As I said, an annoying topic.On 22 Sep 2023, at 13:06, Mick Semb Wever  wrote:On Thu, 21 Sept 2023 at 10:41, Benedict  wrote:At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated to the dependency (or affects it at all?)Anyway, this is an annoying discussion we need to have at some point, so raising it here now so we can figure it out.[1] https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/[2] https://www.apache.org/legal/generative-tooling.htmlMy reading of the ASF's GenAI policy is that any generated work in the jvector library (and cep-30 ?) are not copyrightable, and that makes them ok for us to include.If there was a trace to copyrighted work, or the tooling imposed a copyright or restrictions, we would then have to take considerations.


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Mick Semb Wever
On Thu, 21 Sept 2023 at 10:41, Benedict  wrote:

> At some point we have to discuss this, and here’s as good a place as any.
> There’s a great news article published talking about how generative AI was
> used to assist in developing the new vector search feature, which is itself
> really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal
> policy on use for contributions to the project. This proposal is to include
> a dependency, but I’m not sure if that avoids the issue, and I’m equally
> uncertain how much this issue is isolated to the dependency (or affects it
> at all?)
>
> Anyway, this is an annoying discussion we need to have at some point, so
> raising it here now so we can figure it out.
>
> [1]
> https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/
> 
> [2] https://www.apache.org/legal/generative-tooling.html
>


My reading of the ASF's GenAI policy is that any generated work in the
jvector library (and cep-30 ?) are not copyrightable, and that makes them
ok for us to include.

If there was a trace to copyrighted work, or the tooling imposed a
copyright or restrictions, we would then have to take considerations.


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Mick Semb Wever
Just for my understanding on this. Is the issue that the code has a
> copyright header on it or that it is copyright to a corporate entity?
>


The potential issue here is about dependence upon one vendor (or commercial
actor).
If the project is not usable without a specific piece of work (library)
that is controlled and maintained elsewhere, and exercising our freedom to
rewrite/fork is difficult, the project isn't really independent.  Being
independent is an important tenant for ASF projects.

I don't see this being an issue with jamm or jvector.  But I do think it's
important to check.


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Mike Adamson
Just for my understanding on this. Is the issue that the code has a
copyright header on it or that it is copyright to a corporate entity?

On Fri, 22 Sept 2023 at 10:11, Mick Semb Wever  wrote:

> Especially for an optional feature with clear alternative implementations,
>> this doesn't bother me at all. It's well within ASF policy to include
>> permissively licensed code copyrighted by other people or entities.
>>
>
>
> We should be conscious of the problem if this was a crucial (and evolving)
> part of the code that the project was dependent on, even if only the
> optics of it are problematic.
>
> So long we're asked the question, and this is just an add-on feature that
> the codebase is not dependent on,  and no one has any objections then I'm
> ok with it.
>


-- 
[image: DataStax Logo Square]  *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com 
Find DataStax Online: [image: LinkedIn Logo]

   [image: Facebook Logo]

   [image: Twitter Logo]    [image: RSS Feed]
   [image: Github Logo]



Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-22 Thread Mick Semb Wever
>
> Especially for an optional feature with clear alternative implementations,
> this doesn't bother me at all. It's well within ASF policy to include
> permissively licensed code copyrighted by other people or entities.
>


We should be conscious of the problem if this was a crucial (and evolving)
part of the code that the project was dependent on, even if only the optics
of it are problematic.

So long we're asked the question, and this is just an add-on feature that
the codebase is not dependent on,  and no one has any objections then I'm
ok with it.


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-21 Thread Josh McKenzie
Oops; thought I'd already +1'ed earlier in the thread.

In case it wasn't clear: +1 on inclusion as-is.

On Thu, Sep 21, 2023, at 4:00 PM, Josh McKenzie wrote:
> My .02 re: the copyright: the library is licensed ASL v2.0. Who it's 
> originally copyrighted by / to (Jonathan personally, DataStax as a corporate 
> entity, Santa Claus, my dog :)) doesn't really have any impact on the 
> legalities of our ability to make use of it or the durability or safety of 
> the code in our ecosystem.
> 
> Especially for an optional feature with clear alternative implementations, 
> this doesn't bother me at all. It's well within ASF policy to include 
> permissively licensed code copyrighted by other people or entities.
> 
> On Thu, Sep 21, 2023, at 1:02 PM, Mick Semb Wever wrote:
>> 
>>> I am confused by your +1 here. You are +1 on including it, but only if the 
>>> copyright were different?  Given DataStax wrote the library I don’t see how 
>>> that will change?
>>  
>> 
>> No blocker on including the library.  I'm hoping we can address concerns in 
>> parallel, I don't want to hold things up.  (They might become a blocker on 
>> the next release, depending on where discussions go, so we should start 'em.)
> 


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-21 Thread Josh McKenzie
My .02 re: the copyright: the library is licensed ASL v2.0. Who it's originally 
copyrighted by / to (Jonathan personally, DataStax as a corporate entity, Santa 
Claus, my dog :)) doesn't really have any impact on the legalities of our 
ability to make use of it or the durability or safety of the code in our 
ecosystem.

Especially for an optional feature with clear alternative implementations, this 
doesn't bother me at all. It's well within ASF policy to include permissively 
licensed code copyrighted by other people or entities.

On Thu, Sep 21, 2023, at 1:02 PM, Mick Semb Wever wrote:
> 
>> I am confused by your +1 here. You are +1 on including it, but only if the 
>> copyright were different?  Given DataStax wrote the library I don’t see how 
>> that will change?
>  
> 
> No blocker on including the library.  I'm hoping we can address concerns in 
> parallel, I don't want to hold things up.  (They might become a blocker on 
> the next release, depending on where discussions go, so we should start 'em.)


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-21 Thread Mick Semb Wever
> I am confused by your +1 here. You are +1 on including it, but only if the
> copyright were different?  Given DataStax wrote the library I don’t see how
> that will change?
>


No blocker on including the library.  I'm hoping we can address concerns in
parallel, I don't want to hold things up.  (They might become a blocker on
the next release, depending on where discussions go, so we should start
'em.)


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-21 Thread J. D. Jordan
Mick,I am confused by your +1 here. You are +1 on including it, but only if the copyright were different?  Given DataStax wrote the library I don’t see how that will change?On Sep 21, 2023, at 3:05 AM, Mick Semb Wever  wrote:On Wed, 20 Sept 2023 at 18:31, Mike Adamson  wrote:The original patch for CEP-30 brought several modified Lucene classes in-tree to implement the concurrent HNSW graph used by the vector index.These classes are now being replaced with the io.github.jbellis.jvector library, which contains an improved diskANN implementation for the on-disk graph format. The repo for this library is here: https://github.com/jbellis/jvector.The library does not replace any code used by SAI or other parts of the codebase and is used solely by the vector index.I would welcome any feedback on this change. +1but to nit-pick on legalities… it would be nice to avoid including a library copyrighted to DataStax (for historical reasons).The Jamm library is in a similar state in that it has a license that refers to the copyright owner but does not state the copyright owner anywhere.Can we get a copyright on Jamm, and can both not be Datastax (pls) ? 


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-21 Thread Benedict
At some point we have to discuss this, and here’s as good a place as any. There’s a great news article published talking about how generative AI was used to assist in developing the new vector search feature, which is itself really cool. Unfortunately it *sounds* like it runs afoul of the ASF legal policy on use for contributions to the project. This proposal is to include a dependency, but I’m not sure if that avoids the issue, and I’m equally uncertain how much this issue is isolated to the dependency (or affects it at all?)Anyway, this is an annoying discussion we need to have at some point, so raising it here now so we can figure it out.[1] https://thenewstack.io/how-ai-helped-us-add-vector-search-to-cassandra-in-6-weeks/[2] https://www.apache.org/legal/generative-tooling.htmlOn 21 Sep 2023, at 09:04, Mick Semb Wever  wrote:On Wed, 20 Sept 2023 at 18:31, Mike Adamson  wrote:The original patch for CEP-30 brought several modified Lucene classes in-tree to implement the concurrent HNSW graph used by the vector index.These classes are now being replaced with the io.github.jbellis.jvector library, which contains an improved diskANN implementation for the on-disk graph format. The repo for this library is here: https://github.com/jbellis/jvector.The library does not replace any code used by SAI or other parts of the codebase and is used solely by the vector index.I would welcome any feedback on this change. +1but to nit-pick on legalities… it would be nice to avoid including a library copyrighted to DataStax (for historical reasons).The Jamm library is in a similar state in that it has a license that refers to the copyright owner but does not state the copyright owner anywhere.Can we get a copyright on Jamm, and can both not be Datastax (pls) ? 


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-21 Thread Mick Semb Wever
On Wed, 20 Sept 2023 at 18:31, Mike Adamson  wrote:

> The original patch for CEP-30 brought several modified Lucene classes
> in-tree to implement the concurrent HNSW graph used by the vector index.
> These classes are now being replaced with the io.github.jbellis.jvector
> library, which contains an improved diskANN implementation for the on-disk
> graph format.
> The repo for this library is here: https://github.com/jbellis/jvector.
> The library does not replace any code used by SAI or other parts of the
> codebase and is used solely by the vector index.
> I would welcome any feedback on this change.
>


+1

but to nit-pick on legalities… it would be nice to avoid including a
library copyrighted to DataStax (for historical reasons).
The Jamm library is in a similar state in that it has a license that refers
to the copyright owner but does not state the copyright owner anywhere.

Can we get a copyright on Jamm, and can both not be Datastax (pls) ?


Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-20 Thread J. D. Jordan
+1 for jvector rather than forked lucene classes. On Sep 20, 2023, at 5:14 PM, German Eichberger via dev  wrote:






+1




I am biased because DiskANN is from Microsoft Research but it's  a good library/algorithm


From: Mike Adamson 
Sent: Wednesday, September 20, 2023 8:58 AM
To: dev 
Subject: [EXTERNAL] [DISCUSS] Add JVector as a dependency for CEP-30
 








You don't often get email from madam...@datastax.com. 
Learn why this is important







The original patch for CEP-30 brought several modified Lucene classes in-tree to implement the concurrent HNSW graph used by the vector index.


These classes are now being replaced with the io.github.jbellis.jvector library, which contains an improved diskANN implementation for the on-disk graph format. 


The repo for this library is here: https://github.com/jbellis/jvector.


The library does not replace any code used by SAI or other parts of the codebase and is used solely by the vector index.


I would welcome any feedback on this change.
-- 






Mike Adamson


Engineering




+1 650 389 6000 | datastax.com






Find DataStax Online:
   
   
  















Re: [DISCUSS] Add JVector as a dependency for CEP-30

2023-09-20 Thread German Eichberger via dev
+1

I am biased because DiskANN is from Microsoft Research but it's  a good 
library/algorithm

From: Mike Adamson 
Sent: Wednesday, September 20, 2023 8:58 AM
To: dev 
Subject: [EXTERNAL] [DISCUSS] Add JVector as a dependency for CEP-30

You don't often get email from madam...@datastax.com. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
The original patch for CEP-30 brought several modified Lucene classes in-tree 
to implement the concurrent HNSW graph used by the vector index.

These classes are now being replaced with the io.github.jbellis.jvector 
library, which contains an improved diskANN implementation for the on-disk 
graph format.

The repo for this library is here: https://github.com/jbellis/jvector.

The library does not replace any code used by SAI or other parts of the 
codebase and is used solely by the vector index.

I would welcome any feedback on this change.
--
[DataStax Logo Square]<https://www.datastax.com/>   Mike Adamson
Engineering

+1 650 389 6000 | datastax.com<https://www.datastax.com/>
Find DataStax Online:   [LinkedIn Logo] 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax=DwMFaQ=adz96Xi0w1RHqtPMowiL2g=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q=>
[Facebook Logo] 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax=DwMFaQ=adz96Xi0w1RHqtPMowiL2g=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU=>
[Twitter Logo] <https://twitter.com/DataStax>[RSS Feed] 
<https://www.datastax.com/blog/rss.xml>[Github Logo] 
<https://github.com/datastax>



[DISCUSS] Add JVector as a dependency for CEP-30

2023-09-20 Thread Mike Adamson
The original patch for CEP-30 brought several modified Lucene classes
in-tree to implement the concurrent HNSW graph used by the vector index.

These classes are now being replaced with the io.github.jbellis.jvector
library, which contains an improved diskANN implementation for the on-disk
graph format.

The repo for this library is here: https://github.com/jbellis/jvector.

The library does not replace any code used by SAI or other parts of the
codebase and is used solely by the vector index.

I would welcome any feedback on this change.
-- 
[image: DataStax Logo Square]  *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com 
Find DataStax Online: [image: LinkedIn Logo]

   [image: Facebook Logo]

   [image: Twitter Logo]    [image: RSS Feed]
   [image: Github Logo]