Re: [POLL] Vector type for ML

2023-05-05 Thread Rahul Xavier Singh
Love it. Thank you folks for coming to a decision on this. This is very
helpful to move forward on planning on for the current Python frameworks:

   - Langchain.CassandraVectorStore
   - Langchain.CassandraVectorRetriever
   - Langchain.CassandraVectorStoreAgent
   - LlamaIndex.CassandraVectorLoader
   - LlamaIndex.CassandraVectorIndex


Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Fri, May 5, 2023 at 7:13 PM David Capwell  wrote:

> [CASSANDRA-18504] Added support for type VECTOR - ASF JIRA
> 
> issues.apache.org 
> [image: fav-jsw.png]
> 
> 
>
>
> On May 5, 2023, at 12:27 PM, David Capwell  wrote:
>
> Yep, fair point…. SPARSE VECTOR better maps to NON NULL MAP
>
> On May 5, 2023, at 11:58 AM, David Capwell  wrote:
>
> If we ever add sparse vectors, we can assume that DENSE is the default and
> allow to use either DENSE, SPARSE or nothing.
>
>
> I have been feeling that sparse is just a fixed size list with nulls… so
> array… if you insert {0: 42, 3: 17} then you get a array
> of [42, null, null, 17]?  One negative doing this is any operator/function
> that needs to reify large vectors (lets say 10k elements) you have a ton
> of memory due to us making it a array… so a new type could be used to lower
> this cost…
>
> With DENSE VECTOR we have the syntax in place that we “could” add SPARSE
> later… With VECTOR we will have complications adding a sparse vector after
> the fact due to this implying DENSE…
>
> Updated ranking
>
> *Syntax*
> *Score*
> VECTOR
> 21
> DENSE VECTOR
> 12
> type[dimension]
> 10
> NON NULL [dimention]
> 8
> VECTOR type[n]
> 5
> DENSE_VECTOR
> 4
> NON-NULL FROZEN
> 3
> ARRAY
> 1
>
> *Syntax*
> *Round 1*
> *Round 2*
> VECTOR
> 4
> 4
> DENSE VECTOR
> 2
> 3
> NON NULL [dimention]
> 2
> 1
> VECTOR type[n]
> 1
>
> type[dimension]
> 1
>
> DENSE_VECTOR
> 1
>
> NON-NULL FROZEN
> 1
>
> ARRAY
> 0
>
>
> VECTOR is still in the lead…
>
> On May 5, 2023, at 11:40 AM, Andrés de la Peña 
> wrote:
>
> My vote is:
>
> 1. VECTOR
> 2. DENSE VECTOR
> 3. type[dimension]
>
> If we ever add sparse vectors, we can assume that DENSE is the default and
> allow to use either DENSE, SPARSE or nothing.
>
> Perhaps the dimension could be separated from the type, such as in
> VECTOR[dimension] or VECTOR(dimension).
>
> On Fri, 5 May 2023 at 19:05, David Capwell  wrote:
>
>> ...where, just to be clear, VECTOR means a frozen fixed
>>> size array w/ no null values?
>>>
>> Assuming this is the case
>>
>>
>> The current agreed requirements are:
>>
>> 1) non-null elements
>> 2) fixed length
>> 3) frozen
>>
>> You pointed out 3 isn’t actually required, but that would be a different
>> conversation to remove =)… maybe defer this to JIRA as long as all parties
>> agree in the ticket?
>>
>> With all votes in, this is what I see
>>
>> *Syntax*
>> *Jonathan Ellis*
>> *David Capwell*
>> *Josh McKenzie*
>> *Caleb Rackliffe*
>> *Patrick McFadin*
>> *Brandon Williams*
>> *Mike Adamson*
>> *Benedict*
>> *Mick Semb Wever*
>> *Derek Chen-Becker*
>> VECTOR
>> 1
>> 2
>> 2
>>
>> 2
>> 1
>> 1
>> 3
>> 2
>>
>> DENSE VECTOR
>> 2
>> 1
>>
>>
>> 1
>>
>> 2
>>
>>
>>
>> type[dimension]
>> 3
>> 3
>> 3
>> 1
>>
>> 3
>>
>> 2
>>
>>
>> DENSE_VECTOR
>>
>>
>> 1
>>
>>
>>
>>
>>
>>
>> 3
>> NON NULL [dimention]
>>
>> 1
>>
>>
>>
>>
>>
>> 1
>>
>> 2
>> VECTOR type[n]
>>
>>
>>
>>
>>
>> 2
>>
>>
>> 1
>>
>> ARRAY
>>
>>
>>
>>
>> 3
>>
>>
>>
>>
>>
>> NON-NULL FROZEN
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 1
>>
>> *Rank*
>> *Weight*
>> *1*
>> 3
>> *2*
>> 2
>> *3*
>> 1
>> *?*
>> 3
>>
>> *Syntax*
>> *Score*
>> VECTOR
>> 18
>> DENSE VECTOR
>> 10
>> type[dimension]
>> 9
>> NON NULL [dimention]
>> 8
>> VECTOR type[n]
>> 5
>> DENSE_VECTOR
>> 4
>> NON-NULL FROZEN
>> 3
>> ARRAY
>> 1
>>
>>
>> *Syntax*
>> *Round 1*
>> *Round 2*
>> VECTOR
>> 3
>> 4
>> DENSE VECTOR
>> 2
>> 2
>> NON NULL [dimention]
>> 2
>> 1
>> VECTOR type[n]
>> 1
>>
>> type[dimension]
>> 1
>>
>> DENSE_VECTOR
>> 1
>>
>> NON-NULL FROZEN
>> 1
>>
>> ARRAY
>> 0
>>
>>
>> Under 2 different voting systems vector is in the lead
>> and by a good amount… I have updated the patch locally to reflect this
>> change as well.
>>
>> On May 5, 2023, at 10:41 AM, Mike Adamson  wrote:
>>
>> ...where, just to be clear, VECTOR means a frozen fixed
>>> size array w/ no null values?
>>>
>> Assuming this is the case, my vote is:
>>
>> 1. VECTOR
>> 2. DENSE VECTOR
>>
>> I don't really have a 3rd vote because I 

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
https://issues.apache.org/jira/browse/CASSANDRA-18504

> On May 5, 2023, at 12:27 PM, David Capwell  wrote:
> 
> Yep, fair point…. SPARSE VECTOR better maps to NON NULL MAP
> 
>> On May 5, 2023, at 11:58 AM, David Capwell  wrote:
>> 
>>> If we ever add sparse vectors, we can assume that DENSE is the default and 
>>> allow to use either DENSE, SPARSE or nothing.
>> 
>> I have been feeling that sparse is just a fixed size list with nulls… so 
>> array… if you insert {0: 42, 3: 17} then you get a array of 
>> [42, null, null, 17]?  One negative doing this is any operator/function that 
>> needs to reify large vectors (lets say 10k elements) you have a ton of 
>> memory due to us making it a array… so a new type could be used to lower 
>> this cost…
>> 
>> With DENSE VECTOR we have the syntax in place that we “could” add SPARSE 
>> later… With VECTOR we will have complications adding a sparse vector after 
>> the fact due to this implying DENSE…
>> 
>> Updated ranking
>> 
>> Syntax
>> Score
>> VECTOR
>> 21
>> DENSE VECTOR
>> 12
>> type[dimension]
>> 10
>> NON NULL [dimention]
>> 8
>> VECTOR type[n]
>> 5
>> DENSE_VECTOR
>> 4
>> NON-NULL FROZEN
>> 3
>> ARRAY
>> 1
>> 
>> Syntax
>> Round 1
>> Round 2
>> VECTOR
>> 4
>> 4
>> DENSE VECTOR
>> 2
>> 3
>> NON NULL [dimention]
>> 2
>> 1
>> VECTOR type[n]
>> 1
>> 
>> type[dimension]
>> 1
>> 
>> DENSE_VECTOR
>> 1
>> 
>> NON-NULL FROZEN
>> 1
>> 
>> ARRAY
>> 0
>> 
>> 
>> VECTOR is still in the lead…
>> 
>>> On May 5, 2023, at 11:40 AM, Andrés de la Peña  wrote:
>>> 
>>> My vote is:
>>> 
>>> 1. VECTOR
>>> 2. DENSE VECTOR
>>> 3. type[dimension]
>>> 
>>> If we ever add sparse vectors, we can assume that DENSE is the default and 
>>> allow to use either DENSE, SPARSE or nothing.
>>> 
>>> Perhaps the dimension could be separated from the type, such as in 
>>> VECTOR[dimension] or VECTOR(dimension).
>>> 
>>> On Fri, 5 May 2023 at 19:05, David Capwell >> > wrote:
>> ...where, just to be clear, VECTOR means a frozen fixed 
>> size array w/ no null values?
> Assuming this is the case
 
 The current agreed requirements are:
 
 1) non-null elements
 2) fixed length
 3) frozen 
 
 You pointed out 3 isn’t actually required, but that would be a different 
 conversation to remove =)… maybe defer this to JIRA as long as all parties 
 agree in the ticket?
 
 With all votes in, this is what I see
 
 Syntax
 Jonathan Ellis
 David Capwell
 Josh McKenzie
 Caleb Rackliffe
 Patrick McFadin
 Brandon Williams
 Mike Adamson
 Benedict
 Mick Semb Wever
 Derek Chen-Becker
 VECTOR
 1
 2
 2
 
 2
 1
 1
 3
 2
 
 DENSE VECTOR
 2
 1
 
 
 1
 
 2
 
 
 
 type[dimension]
 3
 3
 3
 1
 
 3
 
 2
 
 
 DENSE_VECTOR
 
 
 1
 
 
 
 
 
 
 3
 NON NULL [dimention]
 
 1
 
 
 
 
 
 1
 
 2
 VECTOR type[n]
 
 
 
 
 
 2
 
 
 1
 
 ARRAY
 
 
 
 
 3
 
 
 
 
 
 NON-NULL FROZEN
 
 
 
 
 
 
 
 
 
 1
 
 Rank
 Weight
 1
 3
 2
 2
 3
 1
 ?
 3
 
 Syntax
 Score
 VECTOR
 18
 DENSE VECTOR
 10
 type[dimension]
 9
 NON NULL [dimention]
 8
 VECTOR type[n]
 5
 DENSE_VECTOR
 4
 NON-NULL FROZEN
 3
 ARRAY
 1
 
 
 Syntax
 Round 1
 Round 2
 VECTOR
 3
 4
 DENSE VECTOR
 2
 2
 NON NULL [dimention]
 2
 1
 VECTOR type[n]
 1
 
 type[dimension]
 1
 
 DENSE_VECTOR
 1
 
 NON-NULL FROZEN
 1
 
 ARRAY
 0
 
 
 Under 2 different voting systems vector is in the lead 
 and by a good amount… I have updated the patch locally to reflect this 
 change as well.
 
> On May 5, 2023, at 10:41 AM, Mike Adamson  > wrote:
> 
>> ...where, just to be clear, VECTOR means a frozen fixed 
>> size array w/ no null values?
> Assuming this is the case, my vote is:
> 
> 1. VECTOR
> 2. DENSE VECTOR
> 
> I don't really have a 3rd vote because I think that type[dimension] is 
> too ambiguous. 
> 
> 
> On Fri, 5 May 2023 at 18:32, Derek Chen-Becker  > wrote:
>> LOL, I'm holding you to that at the summit :) In all seriousness, I'm 
>> glad to see a robust debate around it. I guess for completeness, my 
>> order of preference is 
>> 
>> 1 - NONNULL FROZEN>
>> 2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or 
>> the cardinality?)
>> 3 - DENSE_VECTOR
>> 
>> I guess my main concern with 

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
Yep, fair point…. SPARSE VECTOR better maps to NON NULL MAP

> On May 5, 2023, at 11:58 AM, David Capwell  wrote:
> 
>> If we ever add sparse vectors, we can assume that DENSE is the default and 
>> allow to use either DENSE, SPARSE or nothing.
> 
> I have been feeling that sparse is just a fixed size list with nulls… so 
> array… if you insert {0: 42, 3: 17} then you get a array of 
> [42, null, null, 17]?  One negative doing this is any operator/function that 
> needs to reify large vectors (lets say 10k elements) you have a ton of memory 
> due to us making it a array… so a new type could be used to lower this cost…
> 
> With DENSE VECTOR we have the syntax in place that we “could” add SPARSE 
> later… With VECTOR we will have complications adding a sparse vector after 
> the fact due to this implying DENSE…
> 
> Updated ranking
> 
> Syntax
> Score
> VECTOR
> 21
> DENSE VECTOR
> 12
> type[dimension]
> 10
> NON NULL [dimention]
> 8
> VECTOR type[n]
> 5
> DENSE_VECTOR
> 4
> NON-NULL FROZEN
> 3
> ARRAY
> 1
> 
> Syntax
> Round 1
> Round 2
> VECTOR
> 4
> 4
> DENSE VECTOR
> 2
> 3
> NON NULL [dimention]
> 2
> 1
> VECTOR type[n]
> 1
> 
> type[dimension]
> 1
> 
> DENSE_VECTOR
> 1
> 
> NON-NULL FROZEN
> 1
> 
> ARRAY
> 0
> 
> 
> VECTOR is still in the lead…
> 
>> On May 5, 2023, at 11:40 AM, Andrés de la Peña  wrote:
>> 
>> My vote is:
>> 
>> 1. VECTOR
>> 2. DENSE VECTOR
>> 3. type[dimension]
>> 
>> If we ever add sparse vectors, we can assume that DENSE is the default and 
>> allow to use either DENSE, SPARSE or nothing.
>> 
>> Perhaps the dimension could be separated from the type, such as in 
>> VECTOR[dimension] or VECTOR(dimension).
>> 
>> On Fri, 5 May 2023 at 19:05, David Capwell > > wrote:
> ...where, just to be clear, VECTOR means a frozen fixed 
> size array w/ no null values?
 Assuming this is the case
>>> 
>>> The current agreed requirements are:
>>> 
>>> 1) non-null elements
>>> 2) fixed length
>>> 3) frozen 
>>> 
>>> You pointed out 3 isn’t actually required, but that would be a different 
>>> conversation to remove =)… maybe defer this to JIRA as long as all parties 
>>> agree in the ticket?
>>> 
>>> With all votes in, this is what I see
>>> 
>>> Syntax
>>> Jonathan Ellis
>>> David Capwell
>>> Josh McKenzie
>>> Caleb Rackliffe
>>> Patrick McFadin
>>> Brandon Williams
>>> Mike Adamson
>>> Benedict
>>> Mick Semb Wever
>>> Derek Chen-Becker
>>> VECTOR
>>> 1
>>> 2
>>> 2
>>> 
>>> 2
>>> 1
>>> 1
>>> 3
>>> 2
>>> 
>>> DENSE VECTOR
>>> 2
>>> 1
>>> 
>>> 
>>> 1
>>> 
>>> 2
>>> 
>>> 
>>> 
>>> type[dimension]
>>> 3
>>> 3
>>> 3
>>> 1
>>> 
>>> 3
>>> 
>>> 2
>>> 
>>> 
>>> DENSE_VECTOR
>>> 
>>> 
>>> 1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 3
>>> NON NULL [dimention]
>>> 
>>> 1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 1
>>> 
>>> 2
>>> VECTOR type[n]
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2
>>> 
>>> 
>>> 1
>>> 
>>> ARRAY
>>> 
>>> 
>>> 
>>> 
>>> 3
>>> 
>>> 
>>> 
>>> 
>>> 
>>> NON-NULL FROZEN
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 1
>>> 
>>> Rank
>>> Weight
>>> 1
>>> 3
>>> 2
>>> 2
>>> 3
>>> 1
>>> ?
>>> 3
>>> 
>>> Syntax
>>> Score
>>> VECTOR
>>> 18
>>> DENSE VECTOR
>>> 10
>>> type[dimension]
>>> 9
>>> NON NULL [dimention]
>>> 8
>>> VECTOR type[n]
>>> 5
>>> DENSE_VECTOR
>>> 4
>>> NON-NULL FROZEN
>>> 3
>>> ARRAY
>>> 1
>>> 
>>> 
>>> Syntax
>>> Round 1
>>> Round 2
>>> VECTOR
>>> 3
>>> 4
>>> DENSE VECTOR
>>> 2
>>> 2
>>> NON NULL [dimention]
>>> 2
>>> 1
>>> VECTOR type[n]
>>> 1
>>> 
>>> type[dimension]
>>> 1
>>> 
>>> DENSE_VECTOR
>>> 1
>>> 
>>> NON-NULL FROZEN
>>> 1
>>> 
>>> ARRAY
>>> 0
>>> 
>>> 
>>> Under 2 different voting systems vector is in the lead and 
>>> by a good amount… I have updated the patch locally to reflect this change 
>>> as well.
>>> 
 On May 5, 2023, at 10:41 AM, Mike Adamson >>> > wrote:
 
> ...where, just to be clear, VECTOR means a frozen fixed 
> size array w/ no null values?
 Assuming this is the case, my vote is:
 
 1. VECTOR
 2. DENSE VECTOR
 
 I don't really have a 3rd vote because I think that type[dimension] is too 
 ambiguous. 
 
 
 On Fri, 5 May 2023 at 18:32, Derek Chen-Becker >>> > wrote:
> LOL, I'm holding you to that at the summit :) In all seriousness, I'm 
> glad to see a robust debate around it. I guess for completeness, my order 
> of preference is 
> 
> 1 - NONNULL FROZEN>
> 2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or 
> the cardinality?)
> 3 - DENSE_VECTOR
> 
> I guess my main concern with just "VECTOR" is that it's such an 
> overloaded term. Maybe in ML it means something specific, but for anyone 
> coming from C++, Rust, Java, etc, a Vector is both mutable and can carry 
> null (or equivalent, e.g. None, in Rust). If the argument hadn't also 
> been made that we should be working toward something that's not 
> ML-specific maybe I would be less concerned.
> 

Re: [POLL] Vector type for ML

2023-05-05 Thread Jonathan Ellis
Sparse vector in ML has the semantics that elements not explicitly set are
zero.  I believe most (all?) sparse vector implementations use a map under
the hood; the point is to save a lot of space when you have 10K zeros and
100 that are nonzero.

On Fri, May 5, 2023 at 2:00 PM David Capwell  wrote:

> If we ever add sparse vectors, we can assume that DENSE is the default and
> allow to use either DENSE, SPARSE or nothing.
>
>
> I have been feeling that sparse is just a fixed size list with nulls… so
> array… if you insert {0: 42, 3: 17} then you get a array
> of [42, null, null, 17]?  One negative doing this is any operator/function
> that needs to reify large vectors (lets say 10k elements) you have a ton
> of memory due to us making it a array… so a new type could be used to lower
> this cost…
>
> With DENSE VECTOR we have the syntax in place that we “could” add SPARSE
> later… With VECTOR we will have complications adding a sparse vector after
> the fact due to this implying DENSE…
>
> Updated ranking
>
> *Syntax*
>
> *Score*
>
> VECTOR
>
> 21
>
> DENSE VECTOR
>
> 12
>
> type[dimension]
>
> 10
>
> NON NULL [dimention]
>
> 8
>
> VECTOR type[n]
>
> 5
>
> DENSE_VECTOR
>
> 4
>
> NON-NULL FROZEN
>
> 3
>
> ARRAY
>
> 1
>
> *Syntax*
>
> *Round 1*
>
> *Round 2*
>
> VECTOR
>
> 4
>
> 4
>
> DENSE VECTOR
>
> 2
>
> 3
>
> NON NULL [dimention]
>
> 2
>
> 1
>
> VECTOR type[n]
>
> 1
>
>
> type[dimension]
>
> 1
>
>
> DENSE_VECTOR
>
> 1
>
>
> NON-NULL FROZEN
>
> 1
>
>
> ARRAY
>
> 0
>
>
>
> VECTOR is still in the lead…
>
> On May 5, 2023, at 11:40 AM, Andrés de la Peña 
> wrote:
>
> My vote is:
>
> 1. VECTOR
> 2. DENSE VECTOR
> 3. type[dimension]
>
> If we ever add sparse vectors, we can assume that DENSE is the default and
> allow to use either DENSE, SPARSE or nothing.
>
> Perhaps the dimension could be separated from the type, such as in
> VECTOR[dimension] or VECTOR(dimension).
>
> On Fri, 5 May 2023 at 19:05, David Capwell  wrote:
>
>> ...where, just to be clear, VECTOR means a frozen fixed
>>> size array w/ no null values?
>>>
>> Assuming this is the case
>>
>>
>> The current agreed requirements are:
>>
>> 1) non-null elements
>> 2) fixed length
>> 3) frozen
>>
>> You pointed out 3 isn’t actually required, but that would be a different
>> conversation to remove =)… maybe defer this to JIRA as long as all parties
>> agree in the ticket?
>>
>> With all votes in, this is what I see
>>
>> *Syntax*
>> *Jonathan Ellis*
>> *David Capwell*
>> *Josh McKenzie*
>> *Caleb Rackliffe*
>> *Patrick McFadin*
>> *Brandon Williams*
>> *Mike Adamson*
>> *Benedict*
>> *Mick Semb Wever*
>> *Derek Chen-Becker*
>> VECTOR
>> 1
>> 2
>> 2
>>
>> 2
>> 1
>> 1
>> 3
>> 2
>>
>> DENSE VECTOR
>> 2
>> 1
>>
>>
>> 1
>>
>> 2
>>
>>
>>
>> type[dimension]
>> 3
>> 3
>> 3
>> 1
>>
>> 3
>>
>> 2
>>
>>
>> DENSE_VECTOR
>>
>>
>> 1
>>
>>
>>
>>
>>
>>
>> 3
>> NON NULL [dimention]
>>
>> 1
>>
>>
>>
>>
>>
>> 1
>>
>> 2
>> VECTOR type[n]
>>
>>
>>
>>
>>
>> 2
>>
>>
>> 1
>>
>> ARRAY
>>
>>
>>
>>
>> 3
>>
>>
>>
>>
>>
>> NON-NULL FROZEN
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 1
>>
>> *Rank*
>> *Weight*
>> *1*
>> 3
>> *2*
>> 2
>> *3*
>> 1
>> *?*
>> 3
>>
>> *Syntax*
>> *Score*
>> VECTOR
>> 18
>> DENSE VECTOR
>> 10
>> type[dimension]
>> 9
>> NON NULL [dimention]
>> 8
>> VECTOR type[n]
>> 5
>> DENSE_VECTOR
>> 4
>> NON-NULL FROZEN
>> 3
>> ARRAY
>> 1
>>
>>
>> *Syntax*
>> *Round 1*
>> *Round 2*
>> VECTOR
>> 3
>> 4
>> DENSE VECTOR
>> 2
>> 2
>> NON NULL [dimention]
>> 2
>> 1
>> VECTOR type[n]
>> 1
>>
>> type[dimension]
>> 1
>>
>> DENSE_VECTOR
>> 1
>>
>> NON-NULL FROZEN
>> 1
>>
>> ARRAY
>> 0
>>
>>
>> Under 2 different voting systems vector is in the lead
>> and by a good amount… I have updated the patch locally to reflect this
>> change as well.
>>
>> On May 5, 2023, at 10:41 AM, Mike Adamson  wrote:
>>
>> ...where, just to be clear, VECTOR means a frozen fixed
>>> size array w/ no null values?
>>>
>> Assuming this is the case, my vote is:
>>
>> 1. VECTOR
>> 2. DENSE VECTOR
>>
>> I don't really have a 3rd vote because I think that *type[dimension]* is
>> too ambiguous.
>>
>>
>> On Fri, 5 May 2023 at 18:32, Derek Chen-Becker 
>> wrote:
>>
>>> LOL, I'm holding you to that at the summit :) In all seriousness, I'm
>>> glad to see a robust debate around it. I guess for completeness, my order
>>> of preference is
>>>
>>> 1 - NONNULL FROZEN>
>>> 2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or
>>> the cardinality?)
>>> 3 - DENSE_VECTOR
>>>
>>> I guess my main concern with just "VECTOR" is that it's such an
>>> overloaded term. Maybe in ML it means something specific, but for anyone
>>> coming from C++, Rust, Java, etc, a Vector is both mutable and can carry
>>> null (or equivalent, e.g. None, in Rust). If the argument hadn't also been
>>> made that we should be working toward something that's not ML-specific
>>> maybe I would be less concerned.
>>>
>>> Cheers,
>>>
>>> Derek
>>>
>>>
>>> Cheers,
>>>
>>> Derek
>>>
>>> On Fri, May 5, 2023 at 11:14 AM Patrick McFadin 
>>> wrote:
>>>
 Derek, 

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
> If we ever add sparse vectors, we can assume that DENSE is the default and 
> allow to use either DENSE, SPARSE or nothing.

I have been feeling that sparse is just a fixed size list with nulls… so 
array… if you insert {0: 42, 3: 17} then you get a array of 
[42, null, null, 17]?  One negative doing this is any operator/function that 
needs to reify large vectors (lets say 10k elements) you have a ton of memory 
due to us making it a array… so a new type could be used to lower this cost…

With DENSE VECTOR we have the syntax in place that we “could” add SPARSE later… 
With VECTOR we will have complications adding a sparse vector after the fact 
due to this implying DENSE…

Updated ranking

Syntax
Score
VECTOR
21
DENSE VECTOR
12
type[dimension]
10
NON NULL [dimention]
8
VECTOR type[n]
5
DENSE_VECTOR
4
NON-NULL FROZEN
3
ARRAY
1

Syntax
Round 1
Round 2
VECTOR
4
4
DENSE VECTOR
2
3
NON NULL [dimention]
2
1
VECTOR type[n]
1

type[dimension]
1

DENSE_VECTOR
1

NON-NULL FROZEN
1

ARRAY
0


VECTOR is still in the lead…

> On May 5, 2023, at 11:40 AM, Andrés de la Peña  wrote:
> 
> My vote is:
> 
> 1. VECTOR
> 2. DENSE VECTOR
> 3. type[dimension]
> 
> If we ever add sparse vectors, we can assume that DENSE is the default and 
> allow to use either DENSE, SPARSE or nothing.
> 
> Perhaps the dimension could be separated from the type, such as in 
> VECTOR[dimension] or VECTOR(dimension).
> 
> On Fri, 5 May 2023 at 19:05, David Capwell  > wrote:
 ...where, just to be clear, VECTOR means a frozen fixed 
 size array w/ no null values?
>>> Assuming this is the case
>> 
>> The current agreed requirements are:
>> 
>> 1) non-null elements
>> 2) fixed length
>> 3) frozen 
>> 
>> You pointed out 3 isn’t actually required, but that would be a different 
>> conversation to remove =)… maybe defer this to JIRA as long as all parties 
>> agree in the ticket?
>> 
>> With all votes in, this is what I see
>> 
>> Syntax
>> Jonathan Ellis
>> David Capwell
>> Josh McKenzie
>> Caleb Rackliffe
>> Patrick McFadin
>> Brandon Williams
>> Mike Adamson
>> Benedict
>> Mick Semb Wever
>> Derek Chen-Becker
>> VECTOR
>> 1
>> 2
>> 2
>> 
>> 2
>> 1
>> 1
>> 3
>> 2
>> 
>> DENSE VECTOR
>> 2
>> 1
>> 
>> 
>> 1
>> 
>> 2
>> 
>> 
>> 
>> type[dimension]
>> 3
>> 3
>> 3
>> 1
>> 
>> 3
>> 
>> 2
>> 
>> 
>> DENSE_VECTOR
>> 
>> 
>> 1
>> 
>> 
>> 
>> 
>> 
>> 
>> 3
>> NON NULL [dimention]
>> 
>> 1
>> 
>> 
>> 
>> 
>> 
>> 1
>> 
>> 2
>> VECTOR type[n]
>> 
>> 
>> 
>> 
>> 
>> 2
>> 
>> 
>> 1
>> 
>> ARRAY
>> 
>> 
>> 
>> 
>> 3
>> 
>> 
>> 
>> 
>> 
>> NON-NULL FROZEN
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 1
>> 
>> Rank
>> Weight
>> 1
>> 3
>> 2
>> 2
>> 3
>> 1
>> ?
>> 3
>> 
>> Syntax
>> Score
>> VECTOR
>> 18
>> DENSE VECTOR
>> 10
>> type[dimension]
>> 9
>> NON NULL [dimention]
>> 8
>> VECTOR type[n]
>> 5
>> DENSE_VECTOR
>> 4
>> NON-NULL FROZEN
>> 3
>> ARRAY
>> 1
>> 
>> 
>> Syntax
>> Round 1
>> Round 2
>> VECTOR
>> 3
>> 4
>> DENSE VECTOR
>> 2
>> 2
>> NON NULL [dimention]
>> 2
>> 1
>> VECTOR type[n]
>> 1
>> 
>> type[dimension]
>> 1
>> 
>> DENSE_VECTOR
>> 1
>> 
>> NON-NULL FROZEN
>> 1
>> 
>> ARRAY
>> 0
>> 
>> 
>> Under 2 different voting systems vector is in the lead and 
>> by a good amount… I have updated the patch locally to reflect this change as 
>> well.
>> 
>>> On May 5, 2023, at 10:41 AM, Mike Adamson >> > wrote:
>>> 
 ...where, just to be clear, VECTOR means a frozen fixed 
 size array w/ no null values?
>>> Assuming this is the case, my vote is:
>>> 
>>> 1. VECTOR
>>> 2. DENSE VECTOR
>>> 
>>> I don't really have a 3rd vote because I think that type[dimension] is too 
>>> ambiguous. 
>>> 
>>> 
>>> On Fri, 5 May 2023 at 18:32, Derek Chen-Becker >> > wrote:
 LOL, I'm holding you to that at the summit :) In all seriousness, I'm glad 
 to see a robust debate around it. I guess for completeness, my order of 
 preference is 
 
 1 - NONNULL FROZEN>
 2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or the 
 cardinality?)
 3 - DENSE_VECTOR
 
 I guess my main concern with just "VECTOR" is that it's such an overloaded 
 term. Maybe in ML it means something specific, but for anyone coming from 
 C++, Rust, Java, etc, a Vector is both mutable and can carry null (or 
 equivalent, e.g. None, in Rust). If the argument hadn't also been made 
 that we should be working toward something that's not ML-specific maybe I 
 would be less concerned.
 
 Cheers,
 
 Derek
 
 
 Cheers,
 
 Derek
 
 On Fri, May 5, 2023 at 11:14 AM Patrick McFadin >>> > wrote:
> Derek, despite your preference, I would hang out with you at a party. 
> 
> On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker  > wrote:
>> Speaking as someone who likes Erlang, maybe that's why I also like 
>> NONNULL FROZEN>. It's unambiguous what 

Re: [POLL] Vector type for ML

2023-05-05 Thread Andrés de la Peña
My vote is:

1. VECTOR
2. DENSE VECTOR
3. type[dimension]

If we ever add sparse vectors, we can assume that DENSE is the default and
allow to use either DENSE, SPARSE or nothing.

Perhaps the dimension could be separated from the type, such as in
VECTOR[dimension] or VECTOR(dimension).

On Fri, 5 May 2023 at 19:05, David Capwell  wrote:

> ...where, just to be clear, VECTOR means a frozen fixed
>> size array w/ no null values?
>>
> Assuming this is the case
>
>
> The current agreed requirements are:
>
> 1) non-null elements
> 2) fixed length
> 3) frozen
>
> You pointed out 3 isn’t actually required, but that would be a different
> conversation to remove =)… maybe defer this to JIRA as long as all parties
> agree in the ticket?
>
> With all votes in, this is what I see
>
> *Syntax*
>
> *Jonathan Ellis*
>
> *David Capwell*
>
> *Josh McKenzie*
>
> *Caleb Rackliffe*
>
> *Patrick McFadin*
>
> *Brandon Williams*
>
> *Mike Adamson*
>
> *Benedict*
>
> *Mick Semb Wever*
>
> *Derek Chen-Becker*
>
> VECTOR
>
> 1
>
> 2
>
> 2
>
>
> 2
>
> 1
>
> 1
>
> 3
>
> 2
>
>
> DENSE VECTOR
>
> 2
>
> 1
>
>
>
> 1
>
>
> 2
>
>
>
>
> type[dimension]
>
> 3
>
> 3
>
> 3
>
> 1
>
>
> 3
>
>
> 2
>
>
>
> DENSE_VECTOR
>
>
>
> 1
>
>
>
>
>
>
>
> 3
>
> NON NULL [dimention]
>
>
> 1
>
>
>
>
>
>
> 1
>
>
> 2
>
> VECTOR type[n]
>
>
>
>
>
>
> 2
>
>
>
> 1
>
>
> ARRAY
>
>
>
>
>
> 3
>
>
>
>
>
>
> NON-NULL FROZEN
>
>
>
>
>
>
>
>
>
>
> 1
>
> *Rank*
>
> *Weight*
>
> *1*
>
> 3
>
> *2*
>
> 2
>
> *3*
>
> 1
>
> *?*
>
> 3
>
> *Syntax*
>
> *Score*
>
> VECTOR
>
> 18
>
> DENSE VECTOR
>
> 10
>
> type[dimension]
>
> 9
>
> NON NULL [dimention]
>
> 8
>
> VECTOR type[n]
>
> 5
>
> DENSE_VECTOR
>
> 4
>
> NON-NULL FROZEN
>
> 3
>
> ARRAY
>
> 1
>
>
> *Syntax*
>
> *Round 1*
>
> *Round 2*
>
> VECTOR
>
> 3
>
> 4
>
> DENSE VECTOR
>
> 2
>
> 2
>
> NON NULL [dimention]
>
> 2
>
> 1
>
> VECTOR type[n]
>
> 1
>
>
> type[dimension]
>
> 1
>
>
> DENSE_VECTOR
>
> 1
>
>
> NON-NULL FROZEN
>
> 1
>
>
> ARRAY
>
> 0
>
>
>
> Under 2 different voting systems vector is in the lead
> and by a good amount… I have updated the patch locally to reflect this
> change as well.
>
> On May 5, 2023, at 10:41 AM, Mike Adamson  wrote:
>
> ...where, just to be clear, VECTOR means a frozen fixed
>> size array w/ no null values?
>>
> Assuming this is the case, my vote is:
>
> 1. VECTOR
> 2. DENSE VECTOR
>
> I don't really have a 3rd vote because I think that *type[dimension]* is
> too ambiguous.
>
>
> On Fri, 5 May 2023 at 18:32, Derek Chen-Becker 
> wrote:
>
>> LOL, I'm holding you to that at the summit :) In all seriousness, I'm
>> glad to see a robust debate around it. I guess for completeness, my order
>> of preference is
>>
>> 1 - NONNULL FROZEN>
>> 2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or
>> the cardinality?)
>> 3 - DENSE_VECTOR
>>
>> I guess my main concern with just "VECTOR" is that it's such an
>> overloaded term. Maybe in ML it means something specific, but for anyone
>> coming from C++, Rust, Java, etc, a Vector is both mutable and can carry
>> null (or equivalent, e.g. None, in Rust). If the argument hadn't also been
>> made that we should be working toward something that's not ML-specific
>> maybe I would be less concerned.
>>
>> Cheers,
>>
>> Derek
>>
>>
>> Cheers,
>>
>> Derek
>>
>> On Fri, May 5, 2023 at 11:14 AM Patrick McFadin 
>> wrote:
>>
>>> Derek, despite your preference, I would hang out with you at a party.
>>>
>>> On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker 
>>> wrote:
>>>
 Speaking as someone who likes Erlang, maybe that's why I also like
 NONNULL FROZEN>. It's unambiguous what Cassandra is going to do
 with that type. DENSE VECTOR means I need to go read docs (and then
 probably double-check in the source to be sure) to be sure what exactly is
 going on.

 Cheers,

 Derek

 On Fri, May 5, 2023 at 9:54 AM Patrick McFadin 
 wrote:

> I hope we are willing to consider developers that use our system
> because if I had to teach people to use "NON-NULL FROZEN" I'm
> pretty sure the response would be:
>
> Did you tell me to go write a distributed map-reduce job in Erlang? I
> beleive I did, Bob.
>
> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie 
> wrote:
>
>> Idiomatically, to my mind, there's a question of "what space are we
>> thinking about this datatype in"?
>>
>> - In the context of mathematics, nullability in a vector would be 0
>> - In the context of Cassandra, nullability tends to mean a tombstone
>> (or nothing)
>> - In the context of programming languages, it's all over the place
>>
>> Given many models are exploring quantizing to int8 and other data
>> types, there's definitely the "support other data types easily in the
>> future" piece to me we need to keep in mind.
>>
>> So with the above and the "meet the user where they are and don't
>> make them understand more of Cassandra than absolutely critical to 

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
>> ...where, just to be clear, VECTOR means a frozen fixed 
>> size array w/ no null values?
> Assuming this is the case

The current agreed requirements are:

1) non-null elements
2) fixed length
3) frozen 

You pointed out 3 isn’t actually required, but that would be a different 
conversation to remove =)… maybe defer this to JIRA as long as all parties 
agree in the ticket?

With all votes in, this is what I see

Syntax
Jonathan Ellis
David Capwell
Josh McKenzie
Caleb Rackliffe
Patrick McFadin
Brandon Williams
Mike Adamson
Benedict
Mick Semb Wever
Derek Chen-Becker
VECTOR
1
2
2

2
1
1
3
2

DENSE VECTOR
2
1


1

2



type[dimension]
3
3
3
1

3

2


DENSE_VECTOR


1






3
NON NULL [dimention]

1





1

2
VECTOR type[n]





2


1

ARRAY




3





NON-NULL FROZEN









1

Rank
Weight
1
3
2
2
3
1
?
3

Syntax
Score
VECTOR
18
DENSE VECTOR
10
type[dimension]
9
NON NULL [dimention]
8
VECTOR type[n]
5
DENSE_VECTOR
4
NON-NULL FROZEN
3
ARRAY
1


Syntax
Round 1
Round 2
VECTOR
3
4
DENSE VECTOR
2
2
NON NULL [dimention]
2
1
VECTOR type[n]
1

type[dimension]
1

DENSE_VECTOR
1

NON-NULL FROZEN
1

ARRAY
0


Under 2 different voting systems vector is in the lead and by 
a good amount… I have updated the patch locally to reflect this change as well.

> On May 5, 2023, at 10:41 AM, Mike Adamson  wrote:
> 
>> ...where, just to be clear, VECTOR means a frozen fixed 
>> size array w/ no null values?
> Assuming this is the case, my vote is:
> 
> 1. VECTOR
> 2. DENSE VECTOR
> 
> I don't really have a 3rd vote because I think that type[dimension] is too 
> ambiguous. 
> 
> 
> On Fri, 5 May 2023 at 18:32, Derek Chen-Becker  > wrote:
>> LOL, I'm holding you to that at the summit :) In all seriousness, I'm glad 
>> to see a robust debate around it. I guess for completeness, my order of 
>> preference is 
>> 
>> 1 - NONNULL FROZEN>
>> 2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or the 
>> cardinality?)
>> 3 - DENSE_VECTOR
>> 
>> I guess my main concern with just "VECTOR" is that it's such an overloaded 
>> term. Maybe in ML it means something specific, but for anyone coming from 
>> C++, Rust, Java, etc, a Vector is both mutable and can carry null (or 
>> equivalent, e.g. None, in Rust). If the argument hadn't also been made that 
>> we should be working toward something that's not ML-specific maybe I would 
>> be less concerned.
>> 
>> Cheers,
>> 
>> Derek
>> 
>> 
>> Cheers,
>> 
>> Derek
>> 
>> On Fri, May 5, 2023 at 11:14 AM Patrick McFadin > > wrote:
>>> Derek, despite your preference, I would hang out with you at a party. 
>>> 
>>> On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker >> > wrote:
 Speaking as someone who likes Erlang, maybe that's why I also like NONNULL 
 FROZEN>. It's unambiguous what Cassandra is going to do with 
 that type. DENSE VECTOR means I need to go read docs (and then probably 
 double-check in the source to be sure) to be sure what exactly is going 
 on. 
 
 Cheers,
 
 Derek
 
 On Fri, May 5, 2023 at 9:54 AM Patrick McFadin >>> > wrote:
> I hope we are willing to consider developers that use our system because 
> if I had to teach people to use "NON-NULL FROZEN" I'm pretty 
> sure the response would be:
> 
> Did you tell me to go write a distributed map-reduce job in Erlang? I 
> beleive I did, Bob.  
> 
> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie  > wrote:
>> Idiomatically, to my mind, there's a question of "what space are we 
>> thinking about this datatype in"?
>> 
>> - In the context of mathematics, nullability in a vector would be 0
>> - In the context of Cassandra, nullability tends to mean a tombstone (or 
>> nothing)
>> - In the context of programming languages, it's all over the place
>> 
>> Given many models are exploring quantizing to int8 and other data types, 
>> there's definitely the "support other data types easily in the future" 
>> piece to me we need to keep in mind.
>> 
>> So with the above and the "meet the user where they are and don't make 
>> them understand more of Cassandra than absolutely critical to use it", I 
>> lean:
>> 
>> 1. DENSE_VECTOR
>> 2. VECTOR
>> 3. type[dimension]
>> 
>> This leaves the path open for us to expand on it in the future with 
>> sparse support and allows us to introduce some semantics that indicate 
>> idioms around nullability for the users coming from a different space.
>> 
>> "NON-NULL FROZEN" is strictly correct, however it requires 
>> understanding idioms of how Cassandra thinks about data (nulls mean 
>> different things to us, we have differences between frozen and 
>> non-frozen due to constraints in our storage engine and materialization 
>> of data, etc) that get in the way of 

Re: [POLL] Vector type for ML

2023-05-05 Thread Mike Adamson
>
> ...where, just to be clear, VECTOR means a frozen fixed
> size array w/ no null values?
>
Assuming this is the case, my vote is:

1. VECTOR
2. DENSE VECTOR

I don't really have a 3rd vote because I think that *type[dimension]* is
too ambiguous.


On Fri, 5 May 2023 at 18:32, Derek Chen-Becker 
wrote:

> LOL, I'm holding you to that at the summit :) In all seriousness, I'm glad
> to see a robust debate around it. I guess for completeness, my order of
> preference is
>
> 1 - NONNULL FROZEN>
> 2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or the
> cardinality?)
> 3 - DENSE_VECTOR
>
> I guess my main concern with just "VECTOR" is that it's such an overloaded
> term. Maybe in ML it means something specific, but for anyone coming from
> C++, Rust, Java, etc, a Vector is both mutable and can carry null (or
> equivalent, e.g. None, in Rust). If the argument hadn't also been made that
> we should be working toward something that's not ML-specific maybe I would
> be less concerned.
>
> Cheers,
>
> Derek
>
>
> Cheers,
>
> Derek
>
> On Fri, May 5, 2023 at 11:14 AM Patrick McFadin 
> wrote:
>
>> Derek, despite your preference, I would hang out with you at a party.
>>
>> On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker 
>> wrote:
>>
>>> Speaking as someone who likes Erlang, maybe that's why I also like
>>> NONNULL FROZEN>. It's unambiguous what Cassandra is going to do
>>> with that type. DENSE VECTOR means I need to go read docs (and then
>>> probably double-check in the source to be sure) to be sure what exactly is
>>> going on.
>>>
>>> Cheers,
>>>
>>> Derek
>>>
>>> On Fri, May 5, 2023 at 9:54 AM Patrick McFadin 
>>> wrote:
>>>
 I hope we are willing to consider developers that use our system
 because if I had to teach people to use "NON-NULL FROZEN" I'm
 pretty sure the response would be:

 Did you tell me to go write a distributed map-reduce job in Erlang? I
 beleive I did, Bob.

 On Fri, May 5, 2023 at 8:05 AM Josh McKenzie 
 wrote:

> Idiomatically, to my mind, there's a question of "what space are we
> thinking about this datatype in"?
>
> - In the context of mathematics, nullability in a vector would be 0
> - In the context of Cassandra, nullability tends to mean a tombstone
> (or nothing)
> - In the context of programming languages, it's all over the place
>
> Given many models are exploring quantizing to int8 and other data
> types, there's definitely the "support other data types easily in the
> future" piece to me we need to keep in mind.
>
> So with the above and the "meet the user where they are and don't make
> them understand more of Cassandra than absolutely critical to use it", I
> lean:
>
> 1. DENSE_VECTOR
> 2. VECTOR
> 3. type[dimension]
>
> This leaves the path open for us to expand on it in the future with
> sparse support and allows us to introduce some semantics that indicate
> idioms around nullability for the users coming from a different space.
>
> "NON-NULL FROZEN" is strictly correct, however it requires
> understanding idioms of how Cassandra thinks about data (nulls mean
> different things to us, we have differences between frozen and non-frozen
> due to constraints in our storage engine and materialization of data, etc)
> that get in the way of users doing things in the pattern they're familiar
> with without learning more about the DB than they're probably looking to
> learn. Historically this has been a challenge for us in adoption; the
> classic "Why can't I just write and delete and write as much as I want? 
> Why
> are deletes filling up my disk?" problem comes to mind.
>
> I'd also be happy with us supporting:
> * NON-NULL FROZEN
> * DENSE_VECTOR as syntactic sugar for the above
>
> If getting into the "built-in syntactic sugar mapping for communities
> and specific use-cases" is something we're willing to consider.
>
> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>
> I think we are still discussing implementation here when I'm talking
> about developer experience. I want developers to adopt this quickly, 
> easily
> and be successful. Vector search is already a thing. People use it every
> day. A successful outcome, in my view, is developers picking up this
> feature without reading a manual. (Because they don't anyway and get in
> trouble) I did some more extensive research about what other DBs are using
> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 
> 'SPARSE'
>
> Pinecone[1] - dense_vector, sparse_vector
> Elastic[2]: dense_vector
> Milvus[3]: float_vector, binary_vector
> pgvector[4]: vector
> Weaviate[5]: Different approach. All typed arrays can be indexed
>
> Based on that I'm advocating a similar syntax:
>
> - DENSE VECTOR
> or

Re: [POLL] Vector type for ML

2023-05-05 Thread Derek Chen-Becker
LOL, I'm holding you to that at the summit :) In all seriousness, I'm glad
to see a robust debate around it. I guess for completeness, my order of
preference is

1 - NONNULL FROZEN>
2 - NONNULL TYPE (which part of this implies frozen? The NONNULL or the
cardinality?)
3 - DENSE_VECTOR

I guess my main concern with just "VECTOR" is that it's such an overloaded
term. Maybe in ML it means something specific, but for anyone coming from
C++, Rust, Java, etc, a Vector is both mutable and can carry null (or
equivalent, e.g. None, in Rust). If the argument hadn't also been made that
we should be working toward something that's not ML-specific maybe I would
be less concerned.

Cheers,

Derek


Cheers,

Derek

On Fri, May 5, 2023 at 11:14 AM Patrick McFadin  wrote:

> Derek, despite your preference, I would hang out with you at a party.
>
> On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker 
> wrote:
>
>> Speaking as someone who likes Erlang, maybe that's why I also like
>> NONNULL FROZEN>. It's unambiguous what Cassandra is going to do
>> with that type. DENSE VECTOR means I need to go read docs (and then
>> probably double-check in the source to be sure) to be sure what exactly is
>> going on.
>>
>> Cheers,
>>
>> Derek
>>
>> On Fri, May 5, 2023 at 9:54 AM Patrick McFadin 
>> wrote:
>>
>>> I hope we are willing to consider developers that use our system because
>>> if I had to teach people to use "NON-NULL FROZEN" I'm pretty sure
>>> the response would be:
>>>
>>> Did you tell me to go write a distributed map-reduce job in Erlang? I
>>> beleive I did, Bob.
>>>
>>> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie 
>>> wrote:
>>>
 Idiomatically, to my mind, there's a question of "what space are we
 thinking about this datatype in"?

 - In the context of mathematics, nullability in a vector would be 0
 - In the context of Cassandra, nullability tends to mean a tombstone
 (or nothing)
 - In the context of programming languages, it's all over the place

 Given many models are exploring quantizing to int8 and other data
 types, there's definitely the "support other data types easily in the
 future" piece to me we need to keep in mind.

 So with the above and the "meet the user where they are and don't make
 them understand more of Cassandra than absolutely critical to use it", I
 lean:

 1. DENSE_VECTOR
 2. VECTOR
 3. type[dimension]

 This leaves the path open for us to expand on it in the future with
 sparse support and allows us to introduce some semantics that indicate
 idioms around nullability for the users coming from a different space.

 "NON-NULL FROZEN" is strictly correct, however it requires
 understanding idioms of how Cassandra thinks about data (nulls mean
 different things to us, we have differences between frozen and non-frozen
 due to constraints in our storage engine and materialization of data, etc)
 that get in the way of users doing things in the pattern they're familiar
 with without learning more about the DB than they're probably looking to
 learn. Historically this has been a challenge for us in adoption; the
 classic "Why can't I just write and delete and write as much as I want? Why
 are deletes filling up my disk?" problem comes to mind.

 I'd also be happy with us supporting:
 * NON-NULL FROZEN
 * DENSE_VECTOR as syntactic sugar for the above

 If getting into the "built-in syntactic sugar mapping for communities
 and specific use-cases" is something we're willing to consider.

 On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:

 I think we are still discussing implementation here when I'm talking
 about developer experience. I want developers to adopt this quickly, easily
 and be successful. Vector search is already a thing. People use it every
 day. A successful outcome, in my view, is developers picking up this
 feature without reading a manual. (Because they don't anyway and get in
 trouble) I did some more extensive research about what other DBs are using
 for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'

 Pinecone[1] - dense_vector, sparse_vector
 Elastic[2]: dense_vector
 Milvus[3]: float_vector, binary_vector
 pgvector[4]: vector
 Weaviate[5]: Different approach. All typed arrays can be indexed

 Based on that I'm advocating a similar syntax:

 - DENSE VECTOR
 or
 - VECTOR

 [1] https://docs.pinecone.io/docs/hybrid-search
 [2]
 https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
 [3] https://milvus.io/docs/create_collection.md
 [4] https://github.com/pgvector/pgvector
 [5] https://weaviate.io/developers/weaviate/config-refs/datatypes

 On Fri, May 5, 2023 at 6:07 AM Mike Adamson 
 wrote:

 Then we can have the indexing apparatus only 

Re: [POLL] Vector type for ML

2023-05-05 Thread Patrick McFadin
Derek, despite your preference, I would hang out with you at a party.

On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker 
wrote:

> Speaking as someone who likes Erlang, maybe that's why I also like NONNULL
> FROZEN>. It's unambiguous what Cassandra is going to do with that
> type. DENSE VECTOR means I need to go read docs (and then probably
> double-check in the source to be sure) to be sure what exactly is going on.
>
> Cheers,
>
> Derek
>
> On Fri, May 5, 2023 at 9:54 AM Patrick McFadin  wrote:
>
>> I hope we are willing to consider developers that use our system because
>> if I had to teach people to use "NON-NULL FROZEN" I'm pretty sure
>> the response would be:
>>
>> Did you tell me to go write a distributed map-reduce job in Erlang? I
>> beleive I did, Bob.
>>
>> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie 
>> wrote:
>>
>>> Idiomatically, to my mind, there's a question of "what space are we
>>> thinking about this datatype in"?
>>>
>>> - In the context of mathematics, nullability in a vector would be 0
>>> - In the context of Cassandra, nullability tends to mean a tombstone (or
>>> nothing)
>>> - In the context of programming languages, it's all over the place
>>>
>>> Given many models are exploring quantizing to int8 and other data types,
>>> there's definitely the "support other data types easily in the future"
>>> piece to me we need to keep in mind.
>>>
>>> So with the above and the "meet the user where they are and don't make
>>> them understand more of Cassandra than absolutely critical to use it", I
>>> lean:
>>>
>>> 1. DENSE_VECTOR
>>> 2. VECTOR
>>> 3. type[dimension]
>>>
>>> This leaves the path open for us to expand on it in the future with
>>> sparse support and allows us to introduce some semantics that indicate
>>> idioms around nullability for the users coming from a different space.
>>>
>>> "NON-NULL FROZEN" is strictly correct, however it requires
>>> understanding idioms of how Cassandra thinks about data (nulls mean
>>> different things to us, we have differences between frozen and non-frozen
>>> due to constraints in our storage engine and materialization of data, etc)
>>> that get in the way of users doing things in the pattern they're familiar
>>> with without learning more about the DB than they're probably looking to
>>> learn. Historically this has been a challenge for us in adoption; the
>>> classic "Why can't I just write and delete and write as much as I want? Why
>>> are deletes filling up my disk?" problem comes to mind.
>>>
>>> I'd also be happy with us supporting:
>>> * NON-NULL FROZEN
>>> * DENSE_VECTOR as syntactic sugar for the above
>>>
>>> If getting into the "built-in syntactic sugar mapping for communities
>>> and specific use-cases" is something we're willing to consider.
>>>
>>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>>
>>> I think we are still discussing implementation here when I'm talking
>>> about developer experience. I want developers to adopt this quickly, easily
>>> and be successful. Vector search is already a thing. People use it every
>>> day. A successful outcome, in my view, is developers picking up this
>>> feature without reading a manual. (Because they don't anyway and get in
>>> trouble) I did some more extensive research about what other DBs are using
>>> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>>>
>>> Pinecone[1] - dense_vector, sparse_vector
>>> Elastic[2]: dense_vector
>>> Milvus[3]: float_vector, binary_vector
>>> pgvector[4]: vector
>>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>>
>>> Based on that I'm advocating a similar syntax:
>>>
>>> - DENSE VECTOR
>>> or
>>> - VECTOR
>>>
>>> [1] https://docs.pinecone.io/docs/hybrid-search
>>> [2]
>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>>> [3] https://milvus.io/docs/create_collection.md
>>> [4] https://github.com/pgvector/pgvector
>>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>>>
>>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson 
>>> wrote:
>>>
>>> Then we can have the indexing apparatus only accept *frozen* for
>>> the HSNW case.
>>>
>>> I'm inclined to agree with Benedict that the index will need to be
>>> specifically select by option rather than inferred based on type. As such
>>> there is no real reason for the *frozen* requirement on the type. The
>>> hnsw index can be built just as easily from a non-frozen array.
>>>
>>> I am in favour of enforcing non-null on the elements of an array by
>>> default. I would prefer that allowing nulls in the array would be a later
>>> addition if and when a use case arose for it.
>>>
>>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe 
>>> wrote:
>>>
>>> Even in the ML case, sparse can just mean zeros rather than nulls, and
>>> they should compress similarly anyway.
>>>
>>> If we really want null values, I'd rather leave that in collections
>>> space.
>>>
>>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe 
>>> wrote:

Re: [POLL] Vector type for ML

2023-05-05 Thread Patrick McFadin
My vote is:
1. DENSE VECTOR
2. VECTOR
3. ARRAY


On Fri, May 5, 2023 at 9:43 AM David Capwell  wrote:

> Went through and created a spreed sheet of current votes… For Patric and
> Mike, I don’t see a clear vote, so I put a ? where I “think” your
> preference is… for Mick, I only put one vote as the list looked like a
> summary, but you mentioned the first was your preference
>
> *Syntax*
>
> *Jonathan Ellis*
>
> *David Capwell*
>
> *Josh McKenzie*
>
> *Caleb Rackliffe*
>
> *Patrick McFadin*
>
> *Brandon Williams*
>
> *Mike Adamson*
>
> *Benedict*
>
> *Mick Semb Wever*
>
> VECTOR
>
> 1
>
> 2
>
> 2
>
>
>
> 1
>
> ?
>
> 3
>
>
> DENSE VECTOR
>
> 2
>
> 1
>
>
>
> ?
>
>
> ?
>
>
>
> type[dimension]
>
> 3
>
> 3
>
> 3
>
> 1
>
>
> 3
>
>
> 2
>
>
> DENSE_VECTOR
>
>
>
> 1
>
>
>
>
>
>
>
> NON NULL [dimention]
>
>
> 1
>
>
>
>
>
>
> 1
>
>
> VECTOR type[n]
>
>
>
>
>
>
> 2
>
>
>
> 1
>
> ARRAY
>
>
>
>
>
>
>
>
>
>
> NON-NULL FROZEN
>
>
>
>
>
>
>
>
>
>
>
> 1 = top pick
> 2 = second pick
> 3 = third pick
>
> Let me know if I am missing anyone, or if I have bad data
>
> On May 5, 2023, at 9:23 AM, Jonathan Ellis  wrote:
>
> +10 for not inflicting unwieldy keywords on ML users.
>
> Re Josh's summary, mostly agreed, my only objection to adding the DENSE
> keyword is that I don't see a foreseeable future where we also support
> sparse vectors, so it would end up being unnecessary extra verbosity.  So
> my preference would be
>
> 1. VECTOR
> 2. DENSE VECTOR (space instead of underscore, SQL isn't
> afraid of spaces)
> 3. type[dimension]
>
> On Fri, May 5, 2023 at 10:54 AM Patrick McFadin 
> wrote:
>
>> I hope we are willing to consider developers that use our system because
>> if I had to teach people to use "NON-NULL FROZEN" I'm pretty sure
>> the response would be:
>>
>> Did you tell me to go write a distributed map-reduce job in Erlang? I
>> beleive I did, Bob.
>>
>> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie 
>> wrote:
>>
>>> Idiomatically, to my mind, there's a question of "what space are we
>>> thinking about this datatype in"?
>>>
>>> - In the context of mathematics, nullability in a vector would be 0
>>> - In the context of Cassandra, nullability tends to mean a tombstone (or
>>> nothing)
>>> - In the context of programming languages, it's all over the place
>>>
>>> Given many models are exploring quantizing to int8 and other data types,
>>> there's definitely the "support other data types easily in the future"
>>> piece to me we need to keep in mind.
>>>
>>> So with the above and the "meet the user where they are and don't make
>>> them understand more of Cassandra than absolutely critical to use it", I
>>> lean:
>>>
>>> 1. DENSE_VECTOR
>>> 2. VECTOR
>>> 3. type[dimension]
>>>
>>> This leaves the path open for us to expand on it in the future with
>>> sparse support and allows us to introduce some semantics that indicate
>>> idioms around nullability for the users coming from a different space.
>>>
>>> "NON-NULL FROZEN" is strictly correct, however it requires
>>> understanding idioms of how Cassandra thinks about data (nulls mean
>>> different things to us, we have differences between frozen and non-frozen
>>> due to constraints in our storage engine and materialization of data, etc)
>>> that get in the way of users doing things in the pattern they're familiar
>>> with without learning more about the DB than they're probably looking to
>>> learn. Historically this has been a challenge for us in adoption; the
>>> classic "Why can't I just write and delete and write as much as I want? Why
>>> are deletes filling up my disk?" problem comes to mind.
>>>
>>> I'd also be happy with us supporting:
>>> * NON-NULL FROZEN
>>> * DENSE_VECTOR as syntactic sugar for the above
>>>
>>> If getting into the "built-in syntactic sugar mapping for communities
>>> and specific use-cases" is something we're willing to consider.
>>>
>>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>>
>>> I think we are still discussing implementation here when I'm talking
>>> about developer experience. I want developers to adopt this quickly, easily
>>> and be successful. Vector search is already a thing. People use it every
>>> day. A successful outcome, in my view, is developers picking up this
>>> feature without reading a manual. (Because they don't anyway and get in
>>> trouble) I did some more extensive research about what other DBs are using
>>> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>>>
>>> Pinecone[1] - dense_vector, sparse_vector
>>> Elastic[2]: dense_vector
>>> Milvus[3]: float_vector, binary_vector
>>> pgvector[4]: vector
>>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>>
>>> Based on that I'm advocating a similar syntax:
>>>
>>> - DENSE VECTOR
>>> or
>>> - VECTOR
>>>
>>> [1] https://docs.pinecone.io/docs/hybrid-search
>>> [2]
>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>>> [3] https://milvus.io/docs/create_collection.md
>>> [4] 

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
Sorry, DENSE_VECTOR was pointing to the wrong row, updated score

Syntax
Score
VECTOR
16
DENSE VECTOR
11
type[dimension]
9
NON NULL [dimention]
6
VECTOR type[n]
5
DENSE_VECTOR
3
NON-NULL FROZEN
3
ARRAY
0

> On May 5, 2023, at 10:01 AM, David Capwell  wrote:
> 
> Updated
> 
> Syntax
> Jonathan Ellis
> David Capwell
> Josh McKenzie
> Caleb Rackliffe
> Patrick McFadin
> Brandon Williams
> Mike Adamson
> Benedict
> Mick Semb Wever
> Derek Chen-Becker
> VECTOR
> 1
> 2
> 2
> 
> 
> 1
> ?
> 3
> 2
> 
> DENSE VECTOR
> 2
> 1
> 
> 
> ?
> 
> ?
> 
> 
> 
> type[dimension]
> 3
> 3
> 3
> 1
> 
> 3
> 
> 2
> 
> 
> DENSE_VECTOR
> 
> 
> 1
> 
> 
> 
> 
> 
> 
> 
> NON NULL [dimention]
> 
> 1
> 
> 
> 
> 
> 
> 1
> 
> 
> VECTOR type[n]
> 
> 
> 
> 
> 
> 2
> 
> 
> 1
> 
> ARRAY
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> NON-NULL FROZEN
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 1
> 
> Rank
> Weight
> 1
> 3
> 2
> 2
> 3
> 1
> ?
> 3
> 
> Syntax
> Score
> VECTOR
> 16
> DENSE VECTOR
> 11
> type[dimension]
> 9
> DENSE_VECTOR
> 6
> NON NULL [dimention]
> 6
> VECTOR type[n]
> 5
> NON-NULL FROZEN
> 3
> ARRAY
> 0
> 
> 
> ATM VECTOR is winning with DENSE VECTOR a 
> close second.. Patrick and Mike are the swing votes… Election Day is so 
> exciting! 
> 
>> On May 5, 2023, at 9:53 AM, Mick Semb Wever  wrote:
>> 
>> 
>> 
>> On Fri, 5 May 2023 at 18:43, David Capwell > > wrote:
>>> Went through and created a spreed sheet of current votes… For Patric and 
>>> Mike, I don’t see a clear vote, so I put a ? where I “think” your 
>>> preference is… for Mick, I only put one vote as the list looked like a 
>>> summary, but you mentioned the first was your preference
>>> 
>>> Syntax
>>> Jonathan Ellis
>>> David Capwell
>>> Josh McKenzie
>>> Caleb Rackliffe
>>> Patrick McFadin
>>> Brandon Williams
>>> Mike Adamson
>>> Benedict
>>> Mick Semb Wever
>>> VECTOR
>>> 1
>>> 2
>>> 2
>>> 
>>> 
>>> 1
>>> ?
>>> 3
>>> 
>>> DENSE VECTOR
>>> 2
>>> 1
>>> 
>>> 
>>> ?
>>> 
>>> ?
>>> 
>>> 
>>> type[dimension]
>>> 3
>>> 3
>>> 3
>>> 1
>>> 
>>> 3
>>> 
>>> 2
>>> 
>>> DENSE_VECTOR
>>> 
>>> 
>>> 1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> NON NULL [dimention]
>>> 
>>> 1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 1
>>> 
>>> VECTOR type[n]
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2
>>> 
>>> 
>>> 1
>>> ARRAY
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> NON-NULL FROZEN
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 1 = top pick
>>> 2 = second pick
>>> 3 = third pick
>> 
>> 
>> Is what Josh writes always separate ?? 
>> 
>> My 2 is VECTOR
>> 
>> Thanks David for tallying.
>> 
> 



Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
Updated

Syntax
Jonathan Ellis
David Capwell
Josh McKenzie
Caleb Rackliffe
Patrick McFadin
Brandon Williams
Mike Adamson
Benedict
Mick Semb Wever
Derek Chen-Becker
VECTOR
1
2
2


1
?
3
2

DENSE VECTOR
2
1


?

?



type[dimension]
3
3
3
1

3

2


DENSE_VECTOR


1







NON NULL [dimention]

1





1


VECTOR type[n]





2


1

ARRAY










NON-NULL FROZEN









1

Rank
Weight
1
3
2
2
3
1
?
3

Syntax
Score
VECTOR
16
DENSE VECTOR
11
type[dimension]
9
DENSE_VECTOR
6
NON NULL [dimention]
6
VECTOR type[n]
5
NON-NULL FROZEN
3
ARRAY
0


ATM VECTOR is winning with DENSE VECTOR a 
close second.. Patrick and Mike are the swing votes… Election Day is so 
exciting! 

> On May 5, 2023, at 9:53 AM, Mick Semb Wever  wrote:
> 
> 
> 
> On Fri, 5 May 2023 at 18:43, David Capwell  > wrote:
>> Went through and created a spreed sheet of current votes… For Patric and 
>> Mike, I don’t see a clear vote, so I put a ? where I “think” your preference 
>> is… for Mick, I only put one vote as the list looked like a summary, but you 
>> mentioned the first was your preference
>> 
>> Syntax
>> Jonathan Ellis
>> David Capwell
>> Josh McKenzie
>> Caleb Rackliffe
>> Patrick McFadin
>> Brandon Williams
>> Mike Adamson
>> Benedict
>> Mick Semb Wever
>> VECTOR
>> 1
>> 2
>> 2
>> 
>> 
>> 1
>> ?
>> 3
>> 
>> DENSE VECTOR
>> 2
>> 1
>> 
>> 
>> ?
>> 
>> ?
>> 
>> 
>> type[dimension]
>> 3
>> 3
>> 3
>> 1
>> 
>> 3
>> 
>> 2
>> 
>> DENSE_VECTOR
>> 
>> 
>> 1
>> 
>> 
>> 
>> 
>> 
>> 
>> NON NULL [dimention]
>> 
>> 1
>> 
>> 
>> 
>> 
>> 
>> 1
>> 
>> VECTOR type[n]
>> 
>> 
>> 
>> 
>> 
>> 2
>> 
>> 
>> 1
>> ARRAY
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> NON-NULL FROZEN
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 1 = top pick
>> 2 = second pick
>> 3 = third pick
> 
> 
> Is what Josh writes always separate ?? 
> 
> My 2 is VECTOR
> 
> Thanks David for tallying.
> 



Re: [POLL] Vector type for ML

2023-05-05 Thread Mick Semb Wever
On Fri, 5 May 2023 at 18:43, David Capwell  wrote:

> Went through and created a spreed sheet of current votes… For Patric and
> Mike, I don’t see a clear vote, so I put a ? where I “think” your
> preference is… for Mick, I only put one vote as the list looked like a
> summary, but you mentioned the first was your preference
>
> *Syntax*
>
> *Jonathan Ellis*
>
> *David Capwell*
>
> *Josh McKenzie*
>
> *Caleb Rackliffe*
>
> *Patrick McFadin*
>
> *Brandon Williams*
>
> *Mike Adamson*
>
> *Benedict*
>
> *Mick Semb Wever*
>
> VECTOR
>
> 1
>
> 2
>
> 2
>
>
>
> 1
>
> ?
>
> 3
>
>
> DENSE VECTOR
>
> 2
>
> 1
>
>
>
> ?
>
>
> ?
>
>
>
> type[dimension]
>
> 3
>
> 3
>
> 3
>
> 1
>
>
> 3
>
>
> 2
>
>
> DENSE_VECTOR
>
>
>
> 1
>
>
>
>
>
>
>
> NON NULL [dimention]
>
>
> 1
>
>
>
>
>
>
> 1
>
>
> VECTOR type[n]
>
>
>
>
>
>
> 2
>
>
>
> 1
>
> ARRAY
>
>
>
>
>
>
>
>
>
>
> NON-NULL FROZEN
>
>
>
>
>
>
>
>
>
>
>
> 1 = top pick
> 2 = second pick
> 3 = third pick
>


Is what Josh writes always separate ??

My 2 is VECTOR

Thanks David for tallying.


Re: [POLL] Vector type for ML

2023-05-05 Thread Derek Chen-Becker
Speaking as someone who likes Erlang, maybe that's why I also like NONNULL
FROZEN>. It's unambiguous what Cassandra is going to do with that
type. DENSE VECTOR means I need to go read docs (and then probably
double-check in the source to be sure) to be sure what exactly is going on.

Cheers,

Derek

On Fri, May 5, 2023 at 9:54 AM Patrick McFadin  wrote:

> I hope we are willing to consider developers that use our system because
> if I had to teach people to use "NON-NULL FROZEN" I'm pretty sure
> the response would be:
>
> Did you tell me to go write a distributed map-reduce job in Erlang? I
> beleive I did, Bob.
>
> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie  wrote:
>
>> Idiomatically, to my mind, there's a question of "what space are we
>> thinking about this datatype in"?
>>
>> - In the context of mathematics, nullability in a vector would be 0
>> - In the context of Cassandra, nullability tends to mean a tombstone (or
>> nothing)
>> - In the context of programming languages, it's all over the place
>>
>> Given many models are exploring quantizing to int8 and other data types,
>> there's definitely the "support other data types easily in the future"
>> piece to me we need to keep in mind.
>>
>> So with the above and the "meet the user where they are and don't make
>> them understand more of Cassandra than absolutely critical to use it", I
>> lean:
>>
>> 1. DENSE_VECTOR
>> 2. VECTOR
>> 3. type[dimension]
>>
>> This leaves the path open for us to expand on it in the future with
>> sparse support and allows us to introduce some semantics that indicate
>> idioms around nullability for the users coming from a different space.
>>
>> "NON-NULL FROZEN" is strictly correct, however it requires
>> understanding idioms of how Cassandra thinks about data (nulls mean
>> different things to us, we have differences between frozen and non-frozen
>> due to constraints in our storage engine and materialization of data, etc)
>> that get in the way of users doing things in the pattern they're familiar
>> with without learning more about the DB than they're probably looking to
>> learn. Historically this has been a challenge for us in adoption; the
>> classic "Why can't I just write and delete and write as much as I want? Why
>> are deletes filling up my disk?" problem comes to mind.
>>
>> I'd also be happy with us supporting:
>> * NON-NULL FROZEN
>> * DENSE_VECTOR as syntactic sugar for the above
>>
>> If getting into the "built-in syntactic sugar mapping for communities and
>> specific use-cases" is something we're willing to consider.
>>
>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>
>> I think we are still discussing implementation here when I'm talking
>> about developer experience. I want developers to adopt this quickly, easily
>> and be successful. Vector search is already a thing. People use it every
>> day. A successful outcome, in my view, is developers picking up this
>> feature without reading a manual. (Because they don't anyway and get in
>> trouble) I did some more extensive research about what other DBs are using
>> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>>
>> Pinecone[1] - dense_vector, sparse_vector
>> Elastic[2]: dense_vector
>> Milvus[3]: float_vector, binary_vector
>> pgvector[4]: vector
>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>
>> Based on that I'm advocating a similar syntax:
>>
>> - DENSE VECTOR
>> or
>> - VECTOR
>>
>> [1] https://docs.pinecone.io/docs/hybrid-search
>> [2]
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>> [3] https://milvus.io/docs/create_collection.md
>> [4] https://github.com/pgvector/pgvector
>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>>
>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson 
>> wrote:
>>
>> Then we can have the indexing apparatus only accept *frozen* for
>> the HSNW case.
>>
>> I'm inclined to agree with Benedict that the index will need to be
>> specifically select by option rather than inferred based on type. As such
>> there is no real reason for the *frozen* requirement on the type. The
>> hnsw index can be built just as easily from a non-frozen array.
>>
>> I am in favour of enforcing non-null on the elements of an array by
>> default. I would prefer that allowing nulls in the array would be a later
>> addition if and when a use case arose for it.
>>
>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe 
>> wrote:
>>
>> Even in the ML case, sparse can just mean zeros rather than nulls, and
>> they should compress similarly anyway.
>>
>> If we really want null values, I'd rather leave that in collections space.
>>
>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe 
>> wrote:
>>
>> I actually still prefer *type[dimension]*, because I think I intuitively
>> read this as a primitive (meaning no null elements) array. Then we can have
>> the indexing apparatus only accept *frozen* for the HSNW case.
>>
>> If that isn't intuitive to 

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
Went through and created a spreed sheet of current votes… For Patric and Mike, 
I don’t see a clear vote, so I put a ? where I “think” your preference is… for 
Mick, I only put one vote as the list looked like a summary, but you mentioned 
the first was your preference

Syntax
Jonathan Ellis
David Capwell
Josh McKenzie
Caleb Rackliffe
Patrick McFadin
Brandon Williams
Mike Adamson
Benedict
Mick Semb Wever
VECTOR
1
2
2


1
?
3

DENSE VECTOR
2
1


?

?


type[dimension]
3
3
3
1

3

2

DENSE_VECTOR


1






NON NULL [dimention]

1





1

VECTOR type[n]





2


1
ARRAY









NON-NULL FROZEN










1 = top pick
2 = second pick
3 = third pick

Let me know if I am missing anyone, or if I have bad data

> On May 5, 2023, at 9:23 AM, Jonathan Ellis  wrote:
> 
> +10 for not inflicting unwieldy keywords on ML users.
> 
> Re Josh's summary, mostly agreed, my only objection to adding the DENSE 
> keyword is that I don't see a foreseeable future where we also support sparse 
> vectors, so it would end up being unnecessary extra verbosity.  So my 
> preference would be
> 
> 1. VECTOR
> 2. DENSE VECTOR (space instead of underscore, SQL isn't 
> afraid of spaces)
> 3. type[dimension]
> 
> On Fri, May 5, 2023 at 10:54 AM Patrick McFadin  > wrote:
>> I hope we are willing to consider developers that use our system because if 
>> I had to teach people to use "NON-NULL FROZEN" I'm pretty sure the 
>> response would be:
>> 
>> Did you tell me to go write a distributed map-reduce job in Erlang? I 
>> beleive I did, Bob.  
>> 
>> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie > > wrote:
>>> Idiomatically, to my mind, there's a question of "what space are we 
>>> thinking about this datatype in"?
>>> 
>>> - In the context of mathematics, nullability in a vector would be 0
>>> - In the context of Cassandra, nullability tends to mean a tombstone (or 
>>> nothing)
>>> - In the context of programming languages, it's all over the place
>>> 
>>> Given many models are exploring quantizing to int8 and other data types, 
>>> there's definitely the "support other data types easily in the future" 
>>> piece to me we need to keep in mind.
>>> 
>>> So with the above and the "meet the user where they are and don't make them 
>>> understand more of Cassandra than absolutely critical to use it", I lean:
>>> 
>>> 1. DENSE_VECTOR
>>> 2. VECTOR
>>> 3. type[dimension]
>>> 
>>> This leaves the path open for us to expand on it in the future with sparse 
>>> support and allows us to introduce some semantics that indicate idioms 
>>> around nullability for the users coming from a different space.
>>> 
>>> "NON-NULL FROZEN" is strictly correct, however it requires 
>>> understanding idioms of how Cassandra thinks about data (nulls mean 
>>> different things to us, we have differences between frozen and non-frozen 
>>> due to constraints in our storage engine and materialization of data, etc) 
>>> that get in the way of users doing things in the pattern they're familiar 
>>> with without learning more about the DB than they're probably looking to 
>>> learn. Historically this has been a challenge for us in adoption; the 
>>> classic "Why can't I just write and delete and write as much as I want? Why 
>>> are deletes filling up my disk?" problem comes to mind.
>>> 
>>> I'd also be happy with us supporting:
>>> * NON-NULL FROZEN
>>> * DENSE_VECTOR as syntactic sugar for the above
>>> 
>>> If getting into the "built-in syntactic sugar mapping for communities and 
>>> specific use-cases" is something we're willing to consider.
>>> 
>>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
 I think we are still discussing implementation here when I'm talking about 
 developer experience. I want developers to adopt this quickly, easily and 
 be successful. Vector search is already a thing. People use it every day. 
 A successful outcome, in my view, is developers picking up this feature 
 without reading a manual. (Because they don't anyway and get in trouble) I 
 did some more extensive research about what other DBs are using for 
 syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
 
 Pinecone[1] - dense_vector, sparse_vector
 Elastic[2]: dense_vector
 Milvus[3]: float_vector, binary_vector
 pgvector[4]: vector
 Weaviate[5]: Different approach. All typed arrays can be indexed
 
 Based on that I'm advocating a similar syntax:
 
 - DENSE VECTOR
 or
 - VECTOR
 
 [1] https://docs.pinecone.io/docs/hybrid-search
 [2] 
 https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
 [3] https://milvus.io/docs/create_collection.md
 [4] https://github.com/pgvector/pgvector
 [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
 
 On Fri, May 5, 2023 at 6:07 AM Mike Adamson >>> > wrote:
 Then we can have the 

Re: [POLL] Vector type for ML

2023-05-05 Thread Caleb Rackliffe
...where, just to be clear, VECTOR means a frozen fixed
size array w/ no null values?

On Fri, May 5, 2023 at 11:23 AM Jonathan Ellis  wrote:

> +10 for not inflicting unwieldy keywords on ML users.
>
> Re Josh's summary, mostly agreed, my only objection to adding the DENSE
> keyword is that I don't see a foreseeable future where we also support
> sparse vectors, so it would end up being unnecessary extra verbosity.  So
> my preference would be
>
> 1. VECTOR
> 2. DENSE VECTOR (space instead of underscore, SQL isn't
> afraid of spaces)
> 3. type[dimension]
>
> On Fri, May 5, 2023 at 10:54 AM Patrick McFadin 
> wrote:
>
>> I hope we are willing to consider developers that use our system because
>> if I had to teach people to use "NON-NULL FROZEN" I'm pretty sure
>> the response would be:
>>
>> Did you tell me to go write a distributed map-reduce job in Erlang? I
>> beleive I did, Bob.
>>
>> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie 
>> wrote:
>>
>>> Idiomatically, to my mind, there's a question of "what space are we
>>> thinking about this datatype in"?
>>>
>>> - In the context of mathematics, nullability in a vector would be 0
>>> - In the context of Cassandra, nullability tends to mean a tombstone (or
>>> nothing)
>>> - In the context of programming languages, it's all over the place
>>>
>>> Given many models are exploring quantizing to int8 and other data types,
>>> there's definitely the "support other data types easily in the future"
>>> piece to me we need to keep in mind.
>>>
>>> So with the above and the "meet the user where they are and don't make
>>> them understand more of Cassandra than absolutely critical to use it", I
>>> lean:
>>>
>>> 1. DENSE_VECTOR
>>> 2. VECTOR
>>> 3. type[dimension]
>>>
>>> This leaves the path open for us to expand on it in the future with
>>> sparse support and allows us to introduce some semantics that indicate
>>> idioms around nullability for the users coming from a different space.
>>>
>>> "NON-NULL FROZEN" is strictly correct, however it requires
>>> understanding idioms of how Cassandra thinks about data (nulls mean
>>> different things to us, we have differences between frozen and non-frozen
>>> due to constraints in our storage engine and materialization of data, etc)
>>> that get in the way of users doing things in the pattern they're familiar
>>> with without learning more about the DB than they're probably looking to
>>> learn. Historically this has been a challenge for us in adoption; the
>>> classic "Why can't I just write and delete and write as much as I want? Why
>>> are deletes filling up my disk?" problem comes to mind.
>>>
>>> I'd also be happy with us supporting:
>>> * NON-NULL FROZEN
>>> * DENSE_VECTOR as syntactic sugar for the above
>>>
>>> If getting into the "built-in syntactic sugar mapping for communities
>>> and specific use-cases" is something we're willing to consider.
>>>
>>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>>
>>> I think we are still discussing implementation here when I'm talking
>>> about developer experience. I want developers to adopt this quickly, easily
>>> and be successful. Vector search is already a thing. People use it every
>>> day. A successful outcome, in my view, is developers picking up this
>>> feature without reading a manual. (Because they don't anyway and get in
>>> trouble) I did some more extensive research about what other DBs are using
>>> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>>>
>>> Pinecone[1] - dense_vector, sparse_vector
>>> Elastic[2]: dense_vector
>>> Milvus[3]: float_vector, binary_vector
>>> pgvector[4]: vector
>>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>>
>>> Based on that I'm advocating a similar syntax:
>>>
>>> - DENSE VECTOR
>>> or
>>> - VECTOR
>>>
>>> [1] https://docs.pinecone.io/docs/hybrid-search
>>> [2]
>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>>> [3] https://milvus.io/docs/create_collection.md
>>> [4] https://github.com/pgvector/pgvector
>>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>>>
>>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson 
>>> wrote:
>>>
>>> Then we can have the indexing apparatus only accept *frozen* for
>>> the HSNW case.
>>>
>>> I'm inclined to agree with Benedict that the index will need to be
>>> specifically select by option rather than inferred based on type. As such
>>> there is no real reason for the *frozen* requirement on the type. The
>>> hnsw index can be built just as easily from a non-frozen array.
>>>
>>> I am in favour of enforcing non-null on the elements of an array by
>>> default. I would prefer that allowing nulls in the array would be a later
>>> addition if and when a use case arose for it.
>>>
>>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe 
>>> wrote:
>>>
>>> Even in the ML case, sparse can just mean zeros rather than nulls, and
>>> they should compress similarly anyway.
>>>
>>> If we really want null 

Re: [POLL] Vector type for ML

2023-05-05 Thread Jonathan Ellis
+10 for not inflicting unwieldy keywords on ML users.

Re Josh's summary, mostly agreed, my only objection to adding the DENSE
keyword is that I don't see a foreseeable future where we also support
sparse vectors, so it would end up being unnecessary extra verbosity.  So
my preference would be

1. VECTOR
2. DENSE VECTOR (space instead of underscore, SQL isn't
afraid of spaces)
3. type[dimension]

On Fri, May 5, 2023 at 10:54 AM Patrick McFadin  wrote:

> I hope we are willing to consider developers that use our system because
> if I had to teach people to use "NON-NULL FROZEN" I'm pretty sure
> the response would be:
>
> Did you tell me to go write a distributed map-reduce job in Erlang? I
> beleive I did, Bob.
>
> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie  wrote:
>
>> Idiomatically, to my mind, there's a question of "what space are we
>> thinking about this datatype in"?
>>
>> - In the context of mathematics, nullability in a vector would be 0
>> - In the context of Cassandra, nullability tends to mean a tombstone (or
>> nothing)
>> - In the context of programming languages, it's all over the place
>>
>> Given many models are exploring quantizing to int8 and other data types,
>> there's definitely the "support other data types easily in the future"
>> piece to me we need to keep in mind.
>>
>> So with the above and the "meet the user where they are and don't make
>> them understand more of Cassandra than absolutely critical to use it", I
>> lean:
>>
>> 1. DENSE_VECTOR
>> 2. VECTOR
>> 3. type[dimension]
>>
>> This leaves the path open for us to expand on it in the future with
>> sparse support and allows us to introduce some semantics that indicate
>> idioms around nullability for the users coming from a different space.
>>
>> "NON-NULL FROZEN" is strictly correct, however it requires
>> understanding idioms of how Cassandra thinks about data (nulls mean
>> different things to us, we have differences between frozen and non-frozen
>> due to constraints in our storage engine and materialization of data, etc)
>> that get in the way of users doing things in the pattern they're familiar
>> with without learning more about the DB than they're probably looking to
>> learn. Historically this has been a challenge for us in adoption; the
>> classic "Why can't I just write and delete and write as much as I want? Why
>> are deletes filling up my disk?" problem comes to mind.
>>
>> I'd also be happy with us supporting:
>> * NON-NULL FROZEN
>> * DENSE_VECTOR as syntactic sugar for the above
>>
>> If getting into the "built-in syntactic sugar mapping for communities and
>> specific use-cases" is something we're willing to consider.
>>
>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>
>> I think we are still discussing implementation here when I'm talking
>> about developer experience. I want developers to adopt this quickly, easily
>> and be successful. Vector search is already a thing. People use it every
>> day. A successful outcome, in my view, is developers picking up this
>> feature without reading a manual. (Because they don't anyway and get in
>> trouble) I did some more extensive research about what other DBs are using
>> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>>
>> Pinecone[1] - dense_vector, sparse_vector
>> Elastic[2]: dense_vector
>> Milvus[3]: float_vector, binary_vector
>> pgvector[4]: vector
>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>
>> Based on that I'm advocating a similar syntax:
>>
>> - DENSE VECTOR
>> or
>> - VECTOR
>>
>> [1] https://docs.pinecone.io/docs/hybrid-search
>> [2]
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>> [3] https://milvus.io/docs/create_collection.md
>> [4] https://github.com/pgvector/pgvector
>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>>
>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson 
>> wrote:
>>
>> Then we can have the indexing apparatus only accept *frozen* for
>> the HSNW case.
>>
>> I'm inclined to agree with Benedict that the index will need to be
>> specifically select by option rather than inferred based on type. As such
>> there is no real reason for the *frozen* requirement on the type. The
>> hnsw index can be built just as easily from a non-frozen array.
>>
>> I am in favour of enforcing non-null on the elements of an array by
>> default. I would prefer that allowing nulls in the array would be a later
>> addition if and when a use case arose for it.
>>
>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe 
>> wrote:
>>
>> Even in the ML case, sparse can just mean zeros rather than nulls, and
>> they should compress similarly anyway.
>>
>> If we really want null values, I'd rather leave that in collections space.
>>
>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe 
>> wrote:
>>
>> I actually still prefer *type[dimension]*, because I think I intuitively
>> read this as a primitive (meaning no null elements) array. Then we can have

Re: [POLL] Vector type for ML

2023-05-05 Thread David Capwell
> The hnsw index can be built just as easily from a non-frozen array.

I have 0 issues removing that limitation =)

> I am in favour of enforcing non-null on the elements of an array by default.

This is why I feel DENSE or NON NULL are the best prefix, as those both imply 
elements may not be null.  A sparse vector represents missing data with the 
zero in its domain, so for a nullable type that is null, but for a int that is 
0…. Its still missing data… where as Dense does not allow missing data (aka NON 
NULL)

> Given many models are exploring quantizing to int8 and other data types, 
> there's definitely the "support other data types easily in the future" piece 
> to me we need to keep in mind.

I took Jonathann’s patch and enhanced it to work with every type currently (and 
in the future) of Cassandra… I added random fuzz testing to make sure that type 
level properties are true for all types (and found bugs in many types… yay?)… 

So in my patch you can do the following

DenseVector(42) - represents a float based vector, this is working with 
a fixed length type, so uses a encoding that removes lengths (we know this is 4 
bytes… why write that?)
DenseVector(42) - represents a short based vector, this is working with 
a variable length type (why is ShortType variable length?!?!?!!?!?!?!), so 
encodes the length as part of the format; think frozen list serialization format
DenseVector, DenseVector>> - vector of map… if 
you add Short support, then Map is 0 effort, as they require the same things….  
It actually takes more work to not allow map

I block null data, but for numeric types I do not block 0, as 0 is also a valid 
non-null element… (yay math confusion…)… in my definition [0, 0, 0] is a valid 
3 dim vector of int...

> On May 5, 2023, at 8:53 AM, Patrick McFadin  wrote:
> 
> I hope we are willing to consider developers that use our system because if I 
> had to teach people to use "NON-NULL FROZEN" I'm pretty sure the 
> response would be:
> 
> Did you tell me to go write a distributed map-reduce job in Erlang? I beleive 
> I did, Bob.  
> 
> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie  > wrote:
>> Idiomatically, to my mind, there's a question of "what space are we thinking 
>> about this datatype in"?
>> 
>> - In the context of mathematics, nullability in a vector would be 0
>> - In the context of Cassandra, nullability tends to mean a tombstone (or 
>> nothing)
>> - In the context of programming languages, it's all over the place
>> 
>> Given many models are exploring quantizing to int8 and other data types, 
>> there's definitely the "support other data types easily in the future" piece 
>> to me we need to keep in mind.
>> 
>> So with the above and the "meet the user where they are and don't make them 
>> understand more of Cassandra than absolutely critical to use it", I lean:
>> 
>> 1. DENSE_VECTOR
>> 2. VECTOR
>> 3. type[dimension]
>> 
>> This leaves the path open for us to expand on it in the future with sparse 
>> support and allows us to introduce some semantics that indicate idioms 
>> around nullability for the users coming from a different space.
>> 
>> "NON-NULL FROZEN" is strictly correct, however it requires 
>> understanding idioms of how Cassandra thinks about data (nulls mean 
>> different things to us, we have differences between frozen and non-frozen 
>> due to constraints in our storage engine and materialization of data, etc) 
>> that get in the way of users doing things in the pattern they're familiar 
>> with without learning more about the DB than they're probably looking to 
>> learn. Historically this has been a challenge for us in adoption; the 
>> classic "Why can't I just write and delete and write as much as I want? Why 
>> are deletes filling up my disk?" problem comes to mind.
>> 
>> I'd also be happy with us supporting:
>> * NON-NULL FROZEN
>> * DENSE_VECTOR as syntactic sugar for the above
>> 
>> If getting into the "built-in syntactic sugar mapping for communities and 
>> specific use-cases" is something we're willing to consider.
>> 
>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>> I think we are still discussing implementation here when I'm talking about 
>>> developer experience. I want developers to adopt this quickly, easily and 
>>> be successful. Vector search is already a thing. People use it every day. A 
>>> successful outcome, in my view, is developers picking up this feature 
>>> without reading a manual. (Because they don't anyway and get in trouble) I 
>>> did some more extensive research about what other DBs are using for syntax. 
>>> The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>>> 
>>> Pinecone[1] - dense_vector, sparse_vector
>>> Elastic[2]: dense_vector
>>> Milvus[3]: float_vector, binary_vector
>>> pgvector[4]: vector
>>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>> 
>>> Based on that I'm advocating a similar syntax:
>>> 
>>> - DENSE VECTOR
>>> or

Re: [POLL] Vector type for ML

2023-05-05 Thread Patrick McFadin
I hope we are willing to consider developers that use our system because if
I had to teach people to use "NON-NULL FROZEN" I'm pretty sure the
response would be:

Did you tell me to go write a distributed map-reduce job in Erlang? I
beleive I did, Bob.

On Fri, May 5, 2023 at 8:05 AM Josh McKenzie  wrote:

> Idiomatically, to my mind, there's a question of "what space are we
> thinking about this datatype in"?
>
> - In the context of mathematics, nullability in a vector would be 0
> - In the context of Cassandra, nullability tends to mean a tombstone (or
> nothing)
> - In the context of programming languages, it's all over the place
>
> Given many models are exploring quantizing to int8 and other data types,
> there's definitely the "support other data types easily in the future"
> piece to me we need to keep in mind.
>
> So with the above and the "meet the user where they are and don't make
> them understand more of Cassandra than absolutely critical to use it", I
> lean:
>
> 1. DENSE_VECTOR
> 2. VECTOR
> 3. type[dimension]
>
> This leaves the path open for us to expand on it in the future with sparse
> support and allows us to introduce some semantics that indicate idioms
> around nullability for the users coming from a different space.
>
> "NON-NULL FROZEN" is strictly correct, however it requires
> understanding idioms of how Cassandra thinks about data (nulls mean
> different things to us, we have differences between frozen and non-frozen
> due to constraints in our storage engine and materialization of data, etc)
> that get in the way of users doing things in the pattern they're familiar
> with without learning more about the DB than they're probably looking to
> learn. Historically this has been a challenge for us in adoption; the
> classic "Why can't I just write and delete and write as much as I want? Why
> are deletes filling up my disk?" problem comes to mind.
>
> I'd also be happy with us supporting:
> * NON-NULL FROZEN
> * DENSE_VECTOR as syntactic sugar for the above
>
> If getting into the "built-in syntactic sugar mapping for communities and
> specific use-cases" is something we're willing to consider.
>
> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>
> I think we are still discussing implementation here when I'm talking about
> developer experience. I want developers to adopt this quickly, easily and
> be successful. Vector search is already a thing. People use it every day. A
> successful outcome, in my view, is developers picking up this feature
> without reading a manual. (Because they don't anyway and get in trouble) I
> did some more extensive research about what other DBs are using for syntax.
> The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>
> Pinecone[1] - dense_vector, sparse_vector
> Elastic[2]: dense_vector
> Milvus[3]: float_vector, binary_vector
> pgvector[4]: vector
> Weaviate[5]: Different approach. All typed arrays can be indexed
>
> Based on that I'm advocating a similar syntax:
>
> - DENSE VECTOR
> or
> - VECTOR
>
> [1] https://docs.pinecone.io/docs/hybrid-search
> [2]
> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
> [3] https://milvus.io/docs/create_collection.md
> [4] https://github.com/pgvector/pgvector
> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>
> On Fri, May 5, 2023 at 6:07 AM Mike Adamson  wrote:
>
> Then we can have the indexing apparatus only accept *frozen* for
> the HSNW case.
>
> I'm inclined to agree with Benedict that the index will need to be
> specifically select by option rather than inferred based on type. As such
> there is no real reason for the *frozen* requirement on the type. The
> hnsw index can be built just as easily from a non-frozen array.
>
> I am in favour of enforcing non-null on the elements of an array by
> default. I would prefer that allowing nulls in the array would be a later
> addition if and when a use case arose for it.
>
> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe 
> wrote:
>
> Even in the ML case, sparse can just mean zeros rather than nulls, and
> they should compress similarly anyway.
>
> If we really want null values, I'd rather leave that in collections space.
>
> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe 
> wrote:
>
> I actually still prefer *type[dimension]*, because I think I intuitively
> read this as a primitive (meaning no null elements) array. Then we can have
> the indexing apparatus only accept *frozen* for the HSNW case.
>
> If that isn't intuitive to anyone else, I don't really have a strong
> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
> should indicate single vs. multi-cell, and the other the presence or
> absence of nulls/zeros/whatever.
>
> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin 
> wrote:
>
> I agree with David's reasoning and the use of DENSE (and maybe eventually
> SPARSE). This is terminology well established in the data world, and it
> would lead to much easier adoption 

Re: [POLL] Vector type for ML

2023-05-05 Thread Josh McKenzie
Idiomatically, to my mind, there's a question of "what space are we thinking 
about this datatype in"?

- In the context of mathematics, nullability in a vector would be 0
- In the context of Cassandra, nullability tends to mean a tombstone (or 
nothing)
- In the context of programming languages, it's all over the place

Given many models are exploring quantizing to int8 and other data types, 
there's definitely the "support other data types easily in the future" piece to 
me we need to keep in mind.

So with the above and the "meet the user where they are and don't make them 
understand more of Cassandra than absolutely critical to use it", I lean:

1. DENSE_VECTOR
2. VECTOR
3. type[dimension]

This leaves the path open for us to expand on it in the future with sparse 
support and allows us to introduce some semantics that indicate idioms around 
nullability for the users coming from a different space.

"NON-NULL FROZEN" is strictly correct, however it requires 
understanding idioms of how Cassandra thinks about data (nulls mean different 
things to us, we have differences between frozen and non-frozen due to 
constraints in our storage engine and materialization of data, etc) that get in 
the way of users doing things in the pattern they're familiar with without 
learning more about the DB than they're probably looking to learn. Historically 
this has been a challenge for us in adoption; the classic "Why can't I just 
write and delete and write as much as I want? Why are deletes filling up my 
disk?" problem comes to mind.

I'd also be happy with us supporting:
* NON-NULL FROZEN
* DENSE_VECTOR as syntactic sugar for the above

If getting into the "built-in syntactic sugar mapping for communities and 
specific use-cases" is something we're willing to consider.

On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
> I think we are still discussing implementation here when I'm talking about 
> developer experience. I want developers to adopt this quickly, easily and be 
> successful. Vector search is already a thing. People use it every day. A 
> successful outcome, in my view, is developers picking up this feature without 
> reading a manual. (Because they don't anyway and get in trouble) I did some 
> more extensive research about what other DBs are using for syntax. The 
> consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
> 
> Pinecone[1] - dense_vector, sparse_vector
> Elastic[2]: dense_vector
> Milvus[3]: float_vector, binary_vector
> pgvector[4]: vector
> Weaviate[5]: Different approach. All typed arrays can be indexed
> 
> Based on that I'm advocating a similar syntax:
> 
> - DENSE VECTOR
> or
> - VECTOR
> 
> [1] https://docs.pinecone.io/docs/hybrid-search
> [2] 
> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
> [3] https://milvus.io/docs/create_collection.md
> [4] https://github.com/pgvector/pgvector
> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
> 
> On Fri, May 5, 2023 at 6:07 AM Mike Adamson  wrote:
>>> Then we can have the indexing apparatus only accept *frozen* for 
>>> the HSNW case.
>> I'm inclined to agree with Benedict that the index will need to be 
>> specifically select by option rather than inferred based on type. As such 
>> there is no real reason for the *frozen* requirement on the type. The hnsw 
>> index can be built just as easily from a non-frozen array.
>> 
>> I am in favour of enforcing non-null on the elements of an array by default. 
>> I would prefer that allowing nulls in the array would be a later addition if 
>> and when a use case arose for it.
>> 
>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe  
>> wrote:
>>> Even in the ML case, sparse can just mean zeros rather than nulls, and they 
>>> should compress similarly anyway.
>>> 
>>> If we really want null values, I'd rather leave that in collections space.
>>> 
>>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe  
>>> wrote:
 I actually still prefer *type[dimension]*, because I think I intuitively 
 read this as a primitive (meaning no null elements) array. Then we can 
 have the indexing apparatus only accept *frozen* for the HSNW 
 case.
 
 If that isn't intuitive to anyone else, I don't really have a strong 
 opinion...but...conflating "frozen" and "dense" seems like a bad idea. One 
 should indicate single vs. multi-cell, and the other the presence or 
 absence of nulls/zeros/whatever.
 
 On Thu, May 4, 2023 at 12:51 PM Patrick McFadin  wrote:
> I agree with David's reasoning and the use of DENSE (and maybe eventually 
> SPARSE). This is terminology well established in the data world, and it 
> would lead to much easier adoption from users. VECTOR is close, but I can 
> see having to create a lot of content around "How to use it and not get 
> in trouble." (I have a lot of that content already)
> 
>  - We don't have to explain what it is. A lot of prior art out there 
> 

Re: [POLL] Vector type for ML

2023-05-05 Thread Patrick McFadin
I think we are still discussing implementation here when I'm talking about
developer experience. I want developers to adopt this quickly, easily and
be successful. Vector search is already a thing. People use it every day. A
successful outcome, in my view, is developers picking up this feature
without reading a manual. (Because they don't anyway and get in trouble) I
did some more extensive research about what other DBs are using for syntax.
The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'

Pinecone[1] - dense_vector, sparse_vector
Elastic[2]: dense_vector
Milvus[3]: float_vector, binary_vector
pgvector[4]: vector
Weaviate[5]: Different approach. All typed arrays can be indexed

Based on that I'm advocating a similar syntax:

- DENSE VECTOR
or
- VECTOR

[1] https://docs.pinecone.io/docs/hybrid-search
[2]
https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
[3] https://milvus.io/docs/create_collection.md
[4] https://github.com/pgvector/pgvector
[5] https://weaviate.io/developers/weaviate/config-refs/datatypes

On Fri, May 5, 2023 at 6:07 AM Mike Adamson  wrote:

> Then we can have the indexing apparatus only accept *frozen* for
>> the HSNW case.
>>
> I'm inclined to agree with Benedict that the index will need to be
> specifically select by option rather than inferred based on type. As such
> there is no real reason for the *frozen* requirement on the type. The
> hnsw index can be built just as easily from a non-frozen array.
>
> I am in favour of enforcing non-null on the elements of an array by
> default. I would prefer that allowing nulls in the array would be a later
> addition if and when a use case arose for it.
>
> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe 
> wrote:
>
>> Even in the ML case, sparse can just mean zeros rather than nulls, and
>> they should compress similarly anyway.
>>
>> If we really want null values, I'd rather leave that in collections space.
>>
>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe 
>> wrote:
>>
>>> I actually still prefer *type[dimension]*, because I think I
>>> intuitively read this as a primitive (meaning no null elements) array. Then
>>> we can have the indexing apparatus only accept *frozen* for
>>> the HSNW case.
>>>
>>> If that isn't intuitive to anyone else, I don't really have a strong
>>> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
>>> should indicate single vs. multi-cell, and the other the presence or
>>> absence of nulls/zeros/whatever.
>>>
>>> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin 
>>> wrote:
>>>
 I agree with David's reasoning and the use of DENSE (and maybe
 eventually SPARSE). This is terminology well established in the data world,
 and it would lead to much easier adoption from users. VECTOR is close, but
 I can see having to create a lot of content around "How to use it and not
 get in trouble." (I have a lot of that content already)

  - We don't have to explain what it is. A lot of prior art out there
 already [1][2][3]
  - We're matching an established term with what users would expect. No
 surprises.
  - Shorter ramp-up time for users. Cassandra is being modernized.

 The implementation is flexible, but the interface should empower our
 users to be awesome.

 Patrick

 1 -
 https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
 
 2 -
 https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
 
 3 -
 https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
 

 On Thu, May 4, 2023 at 10:25 AM David Capwell 
 wrote:

> My views have changed over time on syntax and I feel type[dimention]
> may not be the best, so it has gone lower in my own personal ranking… this
> is my current preference
>
> 1) DENSE [dimention] | NON NULL [dimention]
> 2) VECTOR
> 3) type[dimention]
>
> My reasoning for this order
>
> * type[dimention] looks like syntax sugar for array,
> so users may assume list/array semantics, but we limit to non-null 
> elements
> in a frozen array
> * feel VECTOR as a prefix feels out of 

Re: [POLL] Vector type for ML

2023-05-05 Thread Mike Adamson
>
> Then we can have the indexing apparatus only accept *frozen* for
> the HSNW case.
>
I'm inclined to agree with Benedict that the index will need to be
specifically select by option rather than inferred based on type. As such
there is no real reason for the *frozen* requirement on the type. The hnsw
index can be built just as easily from a non-frozen array.

I am in favour of enforcing non-null on the elements of an array by
default. I would prefer that allowing nulls in the array would be a later
addition if and when a use case arose for it.

On Fri, 5 May 2023 at 03:02, Caleb Rackliffe 
wrote:

> Even in the ML case, sparse can just mean zeros rather than nulls, and
> they should compress similarly anyway.
>
> If we really want null values, I'd rather leave that in collections space.
>
> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe 
> wrote:
>
>> I actually still prefer *type[dimension]*, because I think I intuitively
>> read this as a primitive (meaning no null elements) array. Then we can have
>> the indexing apparatus only accept *frozen* for the HSNW case.
>>
>> If that isn't intuitive to anyone else, I don't really have a strong
>> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
>> should indicate single vs. multi-cell, and the other the presence or
>> absence of nulls/zeros/whatever.
>>
>> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin 
>> wrote:
>>
>>> I agree with David's reasoning and the use of DENSE (and maybe
>>> eventually SPARSE). This is terminology well established in the data world,
>>> and it would lead to much easier adoption from users. VECTOR is close, but
>>> I can see having to create a lot of content around "How to use it and not
>>> get in trouble." (I have a lot of that content already)
>>>
>>>  - We don't have to explain what it is. A lot of prior art out there
>>> already [1][2][3]
>>>  - We're matching an established term with what users would expect. No
>>> surprises.
>>>  - Shorter ramp-up time for users. Cassandra is being modernized.
>>>
>>> The implementation is flexible, but the interface should empower our
>>> users to be awesome.
>>>
>>> Patrick
>>>
>>> 1 -
>>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
>>> 
>>> 2 -
>>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
>>> 
>>> 3 -
>>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
>>> 
>>>
>>> On Thu, May 4, 2023 at 10:25 AM David Capwell 
>>> wrote:
>>>
 My views have changed over time on syntax and I feel type[dimention]
 may not be the best, so it has gone lower in my own personal ranking… this
 is my current preference

 1) DENSE [dimention] | NON NULL [dimention]
 2) VECTOR
 3) type[dimention]

 My reasoning for this order

 * type[dimention] looks like syntax sugar for array,
 so users may assume list/array semantics, but we limit to non-null elements
 in a frozen array
 * feel VECTOR as a prefix feels out of place, but VECTOR as a direct
 type makes more sense… this also leads to a possible future of VECTOR
 which is the non-fixed length version of this type.  What makes VECTOR
 different from list/array?  non-null elements and is frozen.  I don’t feel
 that VECTOR really tells users to expect non-null or frozen semantics, as
 there exists different VECTOR types for those reasons (sparse vs dense)…
 * DENSE may be confusing for people coming from languages where this
 just means “sequential layout”, which is what our frozen array/list already
 are… but since the target user is coming from a ML background, this
 shouldn’t offer much confusion.  DENSE just means FROZEN in Cassandra, with
 NON NULL elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just
 acts as syntax sugar for frozen


 On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:

 1. VECTOR
 2. VECTOR FLOAT[n]
 3. FLOAT[N]   (Non null by default)

 Redundant or not, I think having the VECTOR keyword helps signify what
 the app is generally about and helps get buy-in from ML stakeholders.

 On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:


Re: [POLL] Vector type for ML

2023-05-04 Thread Caleb Rackliffe
Even in the ML case, sparse can just mean zeros rather than nulls, and they
should compress similarly anyway.

If we really want null values, I'd rather leave that in collections space.

On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe 
wrote:

> I actually still prefer *type[dimension]*, because I think I intuitively
> read this as a primitive (meaning no null elements) array. Then we can have
> the indexing apparatus only accept *frozen* for the HSNW case.
>
> If that isn't intuitive to anyone else, I don't really have a strong
> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
> should indicate single vs. multi-cell, and the other the presence or
> absence of nulls/zeros/whatever.
>
> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin 
> wrote:
>
>> I agree with David's reasoning and the use of DENSE (and maybe eventually
>> SPARSE). This is terminology well established in the data world, and it
>> would lead to much easier adoption from users. VECTOR is close, but I can
>> see having to create a lot of content around "How to use it and not get in
>> trouble." (I have a lot of that content already)
>>
>>  - We don't have to explain what it is. A lot of prior art out there
>> already [1][2][3]
>>  - We're matching an established term with what users would expect. No
>> surprises.
>>  - Shorter ramp-up time for users. Cassandra is being modernized.
>>
>> The implementation is flexible, but the interface should empower our
>> users to be awesome.
>>
>> Patrick
>>
>> 1 -
>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
>> 2 -
>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
>> 3 -
>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
>>
>> On Thu, May 4, 2023 at 10:25 AM David Capwell  wrote:
>>
>>> My views have changed over time on syntax and I feel type[dimention] may
>>> not be the best, so it has gone lower in my own personal ranking… this is
>>> my current preference
>>>
>>> 1) DENSE [dimention] | NON NULL [dimention]
>>> 2) VECTOR
>>> 3) type[dimention]
>>>
>>> My reasoning for this order
>>>
>>> * type[dimention] looks like syntax sugar for array, so
>>> users may assume list/array semantics, but we limit to non-null elements in
>>> a frozen array
>>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct
>>> type makes more sense… this also leads to a possible future of VECTOR
>>> which is the non-fixed length version of this type.  What makes VECTOR
>>> different from list/array?  non-null elements and is frozen.  I don’t feel
>>> that VECTOR really tells users to expect non-null or frozen semantics, as
>>> there exists different VECTOR types for those reasons (sparse vs dense)…
>>> * DENSE may be confusing for people coming from languages where this
>>> just means “sequential layout”, which is what our frozen array/list already
>>> are… but since the target user is coming from a ML background, this
>>> shouldn’t offer much confusion.  DENSE just means FROZEN in Cassandra, with
>>> NON NULL elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just
>>> acts as syntax sugar for frozen
>>>
>>>
>>> On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:
>>>
>>> 1. VECTOR
>>> 2. VECTOR FLOAT[n]
>>> 3. FLOAT[N]   (Non null by default)
>>>
>>> Redundant or not, I think having the VECTOR keyword helps signify what
>>> the app is generally about and helps get buy-in from ML stakeholders.
>>>
>>> On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>>>
>>>
>>> Hurrah for initial agreement.
>>>
>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>>> think VECTOR should be used to simply imply non-null, as this would be very
>>> unintuitive. More logical would be NONNULL, if this is the only condition
>>> being applied. Alternatively for arrays we could default to NONNULL and
>>> later introduce NULLABLE if we want to permit nulls.
>>>
>>> If the word vector is to be used it makes more sense to make it look
>>> like a list, so VECTOR as here the word VECTOR is clearly not
>>> redundant.
>>>
>>> So, I vote:
>>>
>>> 1) (NON NULL) FLOAT[N]
>>> 2) FLOAT[N]   (Non null by default)
>>> 3) VECTOR
>>>
>>>
>>>
>>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>>>
>>> 
>>>
>>>
>>> Did we agree on a CQL syntax?
>>>
>>> I don’t believe there has been a pool on CQL syntax… my understanding
>>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>>> believe we are waiting for majority rule on this?
>>>
>>>
>>>
>>>
>>> Re-reading that thread, IIUC the valid choices remaining are…
>>>
>>> 1. VECTOR FLOAT[n]
>>> 2. FLOAT VECTOR[n]
>>> 3. VECTOR
>>> 4. VECTOR[n]
>>> 5. ARRAY
>>> 6. NON-NULL FROZEN
>>>
>>>
>>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>>> future of CQL will have FLOAT[n] and FROZEN, 

Re: [POLL] Vector type for ML

2023-05-04 Thread Caleb Rackliffe
I actually still prefer *type[dimension]*, because I think I intuitively
read this as a primitive (meaning no null elements) array. Then we can have
the indexing apparatus only accept *frozen* for the HSNW case.

If that isn't intuitive to anyone else, I don't really have a strong
opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
should indicate single vs. multi-cell, and the other the presence or
absence of nulls/zeros/whatever.

On Thu, May 4, 2023 at 12:51 PM Patrick McFadin  wrote:

> I agree with David's reasoning and the use of DENSE (and maybe eventually
> SPARSE). This is terminology well established in the data world, and it
> would lead to much easier adoption from users. VECTOR is close, but I can
> see having to create a lot of content around "How to use it and not get in
> trouble." (I have a lot of that content already)
>
>  - We don't have to explain what it is. A lot of prior art out there
> already [1][2][3]
>  - We're matching an established term with what users would expect. No
> surprises.
>  - Shorter ramp-up time for users. Cassandra is being modernized.
>
> The implementation is flexible, but the interface should empower our users
> to be awesome.
>
> Patrick
>
> 1 -
> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
> 2 -
> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
> 3 -
> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
>
> On Thu, May 4, 2023 at 10:25 AM David Capwell  wrote:
>
>> My views have changed over time on syntax and I feel type[dimention] may
>> not be the best, so it has gone lower in my own personal ranking… this is
>> my current preference
>>
>> 1) DENSE [dimention] | NON NULL [dimention]
>> 2) VECTOR
>> 3) type[dimention]
>>
>> My reasoning for this order
>>
>> * type[dimention] looks like syntax sugar for array, so
>> users may assume list/array semantics, but we limit to non-null elements in
>> a frozen array
>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type
>> makes more sense… this also leads to a possible future of VECTOR
>> which is the non-fixed length version of this type.  What makes VECTOR
>> different from list/array?  non-null elements and is frozen.  I don’t feel
>> that VECTOR really tells users to expect non-null or frozen semantics, as
>> there exists different VECTOR types for those reasons (sparse vs dense)…
>> * DENSE may be confusing for people coming from languages where this just
>> means “sequential layout”, which is what our frozen array/list already are…
>> but since the target user is coming from a ML background, this shouldn’t
>> offer much confusion.  DENSE just means FROZEN in Cassandra, with NON NULL
>> elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as
>> syntax sugar for frozen
>>
>>
>> On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:
>>
>> 1. VECTOR
>> 2. VECTOR FLOAT[n]
>> 3. FLOAT[N]   (Non null by default)
>>
>> Redundant or not, I think having the VECTOR keyword helps signify what
>> the app is generally about and helps get buy-in from ML stakeholders.
>>
>> On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>>
>>
>> Hurrah for initial agreement.
>>
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>> think VECTOR should be used to simply imply non-null, as this would be very
>> unintuitive. More logical would be NONNULL, if this is the only condition
>> being applied. Alternatively for arrays we could default to NONNULL and
>> later introduce NULLABLE if we want to permit nulls.
>>
>> If the word vector is to be used it makes more sense to make it look like
>> a list, so VECTOR as here the word VECTOR is clearly not
>> redundant.
>>
>> So, I vote:
>>
>> 1) (NON NULL) FLOAT[N]
>> 2) FLOAT[N]   (Non null by default)
>> 3) VECTOR
>>
>>
>>
>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>>
>> 
>>
>>
>> Did we agree on a CQL syntax?
>>
>> I don’t believe there has been a pool on CQL syntax… my understanding
>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>> believe we are waiting for majority rule on this?
>>
>>
>>
>>
>> Re-reading that thread, IIUC the valid choices remaining are…
>>
>> 1. VECTOR FLOAT[n]
>> 2. FLOAT VECTOR[n]
>> 3. VECTOR
>> 4. VECTOR[n]
>> 5. ARRAY
>> 6. NON-NULL FROZEN
>>
>>
>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
>> keyword is: for general cql users; just meaning "non-null and frozen",
>> these gel best together.
>>
>> Options (5) and (6) are for those that feel we can and should provide
>> this type without introducing the vector keyword.
>>
>>
>>
>>


Re: [POLL] Vector type for ML

2023-05-04 Thread Patrick McFadin
I agree with David's reasoning and the use of DENSE (and maybe eventually
SPARSE). This is terminology well established in the data world, and it
would lead to much easier adoption from users. VECTOR is close, but I can
see having to create a lot of content around "How to use it and not get in
trouble." (I have a lot of that content already)

 - We don't have to explain what it is. A lot of prior art out there
already [1][2][3]
 - We're matching an established term with what users would expect. No
surprises.
 - Shorter ramp-up time for users. Cassandra is being modernized.

The implementation is flexible, but the interface should empower our users
to be awesome.

Patrick

1 -
https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
2 -
https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
3 - https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/

On Thu, May 4, 2023 at 10:25 AM David Capwell  wrote:

> My views have changed over time on syntax and I feel type[dimention] may
> not be the best, so it has gone lower in my own personal ranking… this is
> my current preference
>
> 1) DENSE [dimention] | NON NULL [dimention]
> 2) VECTOR
> 3) type[dimention]
>
> My reasoning for this order
>
> * type[dimention] looks like syntax sugar for array, so
> users may assume list/array semantics, but we limit to non-null elements in
> a frozen array
> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type
> makes more sense… this also leads to a possible future of VECTOR
> which is the non-fixed length version of this type.  What makes VECTOR
> different from list/array?  non-null elements and is frozen.  I don’t feel
> that VECTOR really tells users to expect non-null or frozen semantics, as
> there exists different VECTOR types for those reasons (sparse vs dense)…
> * DENSE may be confusing for people coming from languages where this just
> means “sequential layout”, which is what our frozen array/list already are…
> but since the target user is coming from a ML background, this shouldn’t
> offer much confusion.  DENSE just means FROZEN in Cassandra, with NON NULL
> elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as
> syntax sugar for frozen
>
>
> On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:
>
> 1. VECTOR
> 2. VECTOR FLOAT[n]
> 3. FLOAT[N]   (Non null by default)
>
> Redundant or not, I think having the VECTOR keyword helps signify what
> the app is generally about and helps get buy-in from ML stakeholders.
>
> On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>
>
> Hurrah for initial agreement.
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only condition
> being applied. Alternatively for arrays we could default to NONNULL and
> later introduce NULLABLE if we want to permit nulls.
>
> If the word vector is to be used it makes more sense to make it look like
> a list, so VECTOR as here the word VECTOR is clearly not
> redundant.
>
> So, I vote:
>
> 1) (NON NULL) FLOAT[N]
> 2) FLOAT[N]   (Non null by default)
> 3) VECTOR
>
>
>
> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>
> 
>
>
> Did we agree on a CQL syntax?
>
> I don’t believe there has been a pool on CQL syntax… my understanding
> reading all the threads is that there are ~4-5 options and non are -1ed, so
> believe we are waiting for majority rule on this?
>
>
>
>
> Re-reading that thread, IIUC the valid choices remaining are…
>
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
>
>
> Yes I'm putting my preference (1) first ;) because (banging on) if the
> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
> keyword is: for general cql users; just meaning "non-null and frozen",
> these gel best together.
>
> Options (5) and (6) are for those that feel we can and should provide this
> type without introducing the vector keyword.
>
>
>
>


Re: [POLL] Vector type for ML

2023-05-04 Thread David Capwell
My views have changed over time on syntax and I feel type[dimention] may not be 
the best, so it has gone lower in my own personal ranking… this is my current 
preference

1) DENSE [dimention] | NON NULL [dimention]
2) VECTOR
3) type[dimention]

My reasoning for this order

* type[dimention] looks like syntax sugar for array, so users 
may assume list/array semantics, but we limit to non-null elements in a frozen 
array
* feel VECTOR as a prefix feels out of place, but VECTOR as a direct type makes 
more sense… this also leads to a possible future of VECTOR which is the 
non-fixed length version of this type.  What makes VECTOR different from 
list/array?  non-null elements and is frozen.  I don’t feel that VECTOR really 
tells users to expect non-null or frozen semantics, as there exists different 
VECTOR types for those reasons (sparse vs dense)… 
* DENSE may be confusing for people coming from languages where this just means 
“sequential layout”, which is what our frozen array/list already are… but since 
the target user is coming from a ML background, this shouldn’t offer much 
confusion.  DENSE just means FROZEN in Cassandra, with NON NULL elements 
(SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as syntax sugar 
for frozen


> On May 4, 2023, at 4:13 AM, Brandon Williams  wrote:
> 
> 1. VECTOR
> 2. VECTOR FLOAT[n]
> 3. FLOAT[N]   (Non null by default)
> 
> Redundant or not, I think having the VECTOR keyword helps signify what
> the app is generally about and helps get buy-in from ML stakeholders.
> 
> On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>> 
>> Hurrah for initial agreement.
>> 
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR 
>> is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR 
>> should be used to simply imply non-null, as this would be very unintuitive. 
>> More logical would be NONNULL, if this is the only condition being applied. 
>> Alternatively for arrays we could default to NONNULL and later introduce 
>> NULLABLE if we want to permit nulls.
>> 
>> If the word vector is to be used it makes more sense to make it look like a 
>> list, so VECTOR as here the word VECTOR is clearly not redundant.
>> 
>> So, I vote:
>> 
>> 1) (NON NULL) FLOAT[N]
>> 2) FLOAT[N]   (Non null by default)
>> 3) VECTOR
>> 
>> 
>> 
>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>> 
>> 
>>> 
>>> Did we agree on a CQL syntax?
>>> 
>>> I don’t believe there has been a pool on CQL syntax… my understanding 
>>> reading all the threads is that there are ~4-5 options and non are -1ed, so 
>>> believe we are waiting for majority rule on this?
>> 
>> 
>> 
>> Re-reading that thread, IIUC the valid choices remaining are…
>> 
>> 1. VECTOR FLOAT[n]
>> 2. FLOAT VECTOR[n]
>> 3. VECTOR
>> 4. VECTOR[n]
>> 5. ARRAY
>> 6. NON-NULL FROZEN
>> 
>> 
>> Yes I'm putting my preference (1) first ;) because (banging on) if the 
>> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR 
>> keyword is: for general cql users; just meaning "non-null and frozen", these 
>> gel best together.
>> 
>> Options (5) and (6) are for those that feel we can and should provide this 
>> type without introducing the vector keyword.
>> 
>> 



Re: [POLL] Vector type for ML

2023-05-04 Thread Brandon Williams
1. VECTOR
2. VECTOR FLOAT[n]
3. FLOAT[N]   (Non null by default)

Redundant or not, I think having the VECTOR keyword helps signify what
the app is generally about and helps get buy-in from ML stakeholders.

On Thu, May 4, 2023 at 3:45 AM Benedict  wrote:
>
> Hurrah for initial agreement.
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR 
> is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR 
> should be used to simply imply non-null, as this would be very unintuitive. 
> More logical would be NONNULL, if this is the only condition being applied. 
> Alternatively for arrays we could default to NONNULL and later introduce 
> NULLABLE if we want to permit nulls.
>
> If the word vector is to be used it makes more sense to make it look like a 
> list, so VECTOR as here the word VECTOR is clearly not redundant.
>
> So, I vote:
>
> 1) (NON NULL) FLOAT[N]
> 2) FLOAT[N]   (Non null by default)
> 3) VECTOR
>
>
>
> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>
> 
>>
>> Did we agree on a CQL syntax?
>>
>> I don’t believe there has been a pool on CQL syntax… my understanding 
>> reading all the threads is that there are ~4-5 options and non are -1ed, so 
>> believe we are waiting for majority rule on this?
>
>
>
> Re-reading that thread, IIUC the valid choices remaining are…
>
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
>
>
> Yes I'm putting my preference (1) first ;) because (banging on) if the future 
> of CQL will have FLOAT[n] and FROZEN, where the VECTOR keyword is: 
> for general cql users; just meaning "non-null and frozen", these gel best 
> together.
>
> Options (5) and (6) are for those that feel we can and should provide this 
> type without introducing the vector keyword.
>
>


Re: [POLL] Vector type for ML

2023-05-04 Thread Mike Adamson
That's fair comment. In this case I would be happy with any of your
suggestions although I would prefer that the datatype did not support
nulls.

On Thu, 4 May 2023 at 11:55, Benedict  wrote:

> I would expect that the type of index would be specified anyway?
>
> I don’t think it’s good API design to have the field define the index you
> create - only to shape what is permitted.
>
> A HNSW index is very specific and should be asked for specifically, not
> implicitly, IMO.
>
> On 4 May 2023, at 11:47, Mike Adamson  wrote:
>
> 
>
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>> think VECTOR should be used to simply imply non-null, as this would be very
>> unintuitive. More logical would be NONNULL, if this is the only condition
>> being applied. Alternatively for arrays we could default to NONNULL and
>> later introduce NULLABLE if we want to permit nulls.
>>
>
> I have a small issue relating to not having a specific VECTOR tag on the
> data type. The driver behind adding this datatype is the hnsw index that is
> being added to consume this data. If we have a generic array datatype, what
> is the expectation going to be for users who create an index on it? The
> hnsw index will support only floats initially so we would have to reject
> any non-float arrays if an attempt was made to create an hnsw index on it.
> While there is no problem with doing this, there would be a problem if, in
> the future, we allow indexing in arrays in the same way that we index
> collections. In this case we would then need to have the user select what
> type of index they want at creation time.
>
> Can I add another proposal that we allow a VECTOR or DENSE (this is a well
> known term in the ML space) keyword that could be used when the array is
> going to be used for ML workloads. This would be optional and would
> function similarly to FROZEN in that it would limit the functionality of
> the array to ML usage.
>
> On Thu, 4 May 2023 at 09:45, Benedict  wrote:
>
>> Hurrah for initial agreement.
>>
>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>> think VECTOR should be used to simply imply non-null, as this would be very
>> unintuitive. More logical would be NONNULL, if this is the only condition
>> being applied. Alternatively for arrays we could default to NONNULL and
>> later introduce NULLABLE if we want to permit nulls.
>>
>> If the word vector is to be used it makes more sense to make it look like
>> a list, so VECTOR as here the word VECTOR is clearly not
>> redundant.
>>
>> So, I vote:
>>
>> 1) (NON NULL) FLOAT[N]
>> 2) FLOAT[N]   (Non null by default)
>> 3) VECTOR
>>
>>
>>
>> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>>
>> 
>>
>>> Did we agree on a CQL syntax?
>>>
>>> I don’t believe there has been a pool on CQL syntax… my understanding
>>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>>> believe we are waiting for majority rule on this?
>>>
>>
>>
>> Re-reading that thread, IIUC the valid choices remaining are…
>>
>> 1. VECTOR FLOAT[n]
>> 2. FLOAT VECTOR[n]
>> 3. VECTOR
>> 4. VECTOR[n]
>> 5. ARRAY
>> 6. NON-NULL FROZEN
>>
>>
>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
>> keyword is: for general cql users; just meaning "non-null and frozen",
>> these gel best together.
>>
>> Options (5) and (6) are for those that feel we can and should provide
>> this type without introducing the vector keyword.
>>
>>
>>
>>
>
> --
> [image: DataStax Logo Square]  *Mike Adamson*
> Engineering
>
> +1 650 389 6000 <16503896000> | datastax.com 
> Find DataStax Online: [image: LinkedIn Logo]
> 
>[image: Facebook Logo]
> 
>[image: Twitter Logo]    [image: RSS
> Feed]    [image: Github Logo]
> 
>
>

-- 
[image: DataStax Logo Square]  *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com 
Find DataStax Online: [image: LinkedIn Logo]

Re: [POLL] Vector type for ML

2023-05-04 Thread Benedict
I would expect that the type of index would be specified anyway?I don’t think it’s good API design to have the field define the index you create - only to shape what is permitted.A HNSW index is very specific and should be asked for specifically, not implicitly, IMO.On 4 May 2023, at 11:47, Mike Adamson  wrote:For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR should be used to simply imply non-null, as this would be very unintuitive. More logical would be NONNULL, if this is the only condition being applied. Alternatively for arrays we could default to NONNULL and later introduce NULLABLE if we want to permit nulls.I have a small issue relating to not having a specific VECTOR tag on the data type. The driver behind adding this datatype is the hnsw index that is being added to consume this data. If we have a generic array datatype, what is the expectation going to be for users who create an index on it? The hnsw index will support only floats initially so we would have to reject any non-float arrays if an attempt was made to create an hnsw index on it. While there is no problem with doing this, there would be a problem if, in the future, we allow indexing in arrays in the same way that we index collections. In this case we would then need to have the user select what type of index they want at creation time.Can I add another proposal that we allow a VECTOR or DENSE (this is a well known term in the ML space) keyword that could be used when the array is going to be used for ML workloads. This would be optional and would function similarly to FROZEN in that it would limit the functionality of the array to ML usage. On Thu, 4 May 2023 at 09:45, Benedict  wrote:Hurrah for initial agreement.For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR should be used to simply imply non-null, as this would be very unintuitive. More logical would be NONNULL, if this is the only condition being applied. Alternatively for arrays we could default to NONNULL and later introduce NULLABLE if we want to permit nulls.If the word vector is to be used it makes more sense to make it look like a list, so VECTOR as here the word VECTOR is clearly not redundant.So, I vote:1) (NON NULL) FLOAT[N]2) FLOAT[N]   (Non null by default)3) VECTOROn 4 May 2023, at 08:52, Mick Semb Wever  wrote:Did we agree on a CQL syntax?I don’t believe there has been a pool on CQL syntax… my understanding reading all the threads is that there are ~4-5 options and non are -1ed, so believe we are waiting for majority rule on this?Re-reading that thread, IIUC the valid choices remaining are…1. VECTOR FLOAT[n]2. FLOAT VECTOR[n]3. VECTOR4. VECTOR[n]5. ARRAY6. NON-NULL FROZENYes I'm putting my preference (1) first ;) because (banging on) if the future of CQL will have FLOAT[n] and FROZEN, where the VECTOR keyword is: for general cql users; just meaning "non-null and frozen", these gel best together.Options (5) and (6) are for those that feel we can and should provide this type without introducing the vector keyword. 

-- Mike AdamsonEngineering+1 650 389 6000 | datastax.comFind DataStax Online:        


Re: [POLL] Vector type for ML

2023-05-04 Thread Mike Adamson
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only condition
> being applied. Alternatively for arrays we could default to NONNULL and
> later introduce NULLABLE if we want to permit nulls.
>

I have a small issue relating to not having a specific VECTOR tag on the
data type. The driver behind adding this datatype is the hnsw index that is
being added to consume this data. If we have a generic array datatype, what
is the expectation going to be for users who create an index on it? The
hnsw index will support only floats initially so we would have to reject
any non-float arrays if an attempt was made to create an hnsw index on it.
While there is no problem with doing this, there would be a problem if, in
the future, we allow indexing in arrays in the same way that we index
collections. In this case we would then need to have the user select what
type of index they want at creation time.

Can I add another proposal that we allow a VECTOR or DENSE (this is a well
known term in the ML space) keyword that could be used when the array is
going to be used for ML workloads. This would be optional and would
function similarly to FROZEN in that it would limit the functionality of
the array to ML usage.

On Thu, 4 May 2023 at 09:45, Benedict  wrote:

> Hurrah for initial agreement.
>
> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
> think VECTOR should be used to simply imply non-null, as this would be very
> unintuitive. More logical would be NONNULL, if this is the only condition
> being applied. Alternatively for arrays we could default to NONNULL and
> later introduce NULLABLE if we want to permit nulls.
>
> If the word vector is to be used it makes more sense to make it look like
> a list, so VECTOR as here the word VECTOR is clearly not
> redundant.
>
> So, I vote:
>
> 1) (NON NULL) FLOAT[N]
> 2) FLOAT[N]   (Non null by default)
> 3) VECTOR
>
>
>
> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
>
> 
>
>> Did we agree on a CQL syntax?
>>
>> I don’t believe there has been a pool on CQL syntax… my understanding
>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>> believe we are waiting for majority rule on this?
>>
>
>
> Re-reading that thread, IIUC the valid choices remaining are…
>
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
>
>
> Yes I'm putting my preference (1) first ;) because (banging on) if the
> future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
> keyword is: for general cql users; just meaning "non-null and frozen",
> these gel best together.
>
> Options (5) and (6) are for those that feel we can and should provide this
> type without introducing the vector keyword.
>
>
>
>

-- 
[image: DataStax Logo Square]  *Mike Adamson*
Engineering

+1 650 389 6000 <16503896000> | datastax.com 
Find DataStax Online: [image: LinkedIn Logo]

   [image: Facebook Logo]

   [image: Twitter Logo]    [image: RSS Feed]
   [image: Github Logo]



Re: [POLL] Vector type for ML

2023-05-04 Thread Benedict
Hurrah for initial agreement.

For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], VECTOR is 
redundant - FLOAT[N] is fully descriptive by itself. I don’t think VECTOR 
should be used to simply imply non-null, as this would be very unintuitive. 
More logical would be NONNULL, if this is the only condition being applied. 
Alternatively for arrays we could default to NONNULL and later introduce 
NULLABLE if we want to permit nulls.

If the word vector is to be used it makes more sense to make it look like a 
list, so VECTOR as here the word VECTOR is clearly not redundant.

So, I vote:

1) (NON NULL) FLOAT[N]
2) FLOAT[N]   (Non null by default)
3) VECTOR



> On 4 May 2023, at 08:52, Mick Semb Wever  wrote:
> 
> 
>>> Did we agree on a CQL syntax?
>> I don’t believe there has been a pool on CQL syntax… my understanding 
>> reading all the threads is that there are ~4-5 options and non are -1ed, so 
>> believe we are waiting for majority rule on this?
> 
> 
> Re-reading that thread, IIUC the valid choices remaining are…
> 
> 1. VECTOR FLOAT[n]
> 2. FLOAT VECTOR[n]
> 3. VECTOR
> 4. VECTOR[n]
> 5. ARRAY
> 6. NON-NULL FROZEN
> 
> 
> Yes I'm putting my preference (1) first ;) because (banging on) if the future 
> of CQL will have FLOAT[n] and FROZEN, where the VECTOR keyword is: 
> for general cql users; just meaning "non-null and frozen", these gel best 
> together.
> 
> Options (5) and (6) are for those that feel we can and should provide this 
> type without introducing the vector keyword.
> 
>  


Re: [POLL] Vector type for ML

2023-05-04 Thread Mick Semb Wever
>
> Did we agree on a CQL syntax?
>
> I don’t believe there has been a pool on CQL syntax… my understanding
> reading all the threads is that there are ~4-5 options and non are -1ed, so
> believe we are waiting for majority rule on this?
>


Re-reading that thread, IIUC the valid choices remaining are…

1. VECTOR FLOAT[n]
2. FLOAT VECTOR[n]
3. VECTOR
4. VECTOR[n]
5. ARRAY
6. NON-NULL FROZEN


Yes I'm putting my preference (1) first ;) because (banging on) if the
future of CQL will have FLOAT[n] and FROZEN, where the VECTOR
keyword is: for general cql users; just meaning "non-null and frozen",
these gel best together.

Options (5) and (6) are for those that feel we can and should provide this
type without introducing the vector keyword.


Re: [POLL] Vector type for ML

2023-05-03 Thread David Capwell
> Did we agree on a CQL syntax?

I don’t believe there has been a pool on CQL syntax… my understanding reading 
all the threads is that there are ~4-5 options and non are -1ed, so believe we 
are waiting for majority rule on this?

> On May 3, 2023, at 1:23 PM, Jeremiah D Jordan  wrote:
> 
>> To be clear, I support the general agreement David and Jonathan seem to have 
>> reached.
> 
> +1 as well.
> 
> 
>> On May 3, 2023, at 3:07 PM, Caleb Rackliffe  wrote:
>> 
>> To be clear, I support the general agreement David and Jonathan seem to have 
>> reached.
>> 
>> On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe > > wrote:
>>> Did we agree on a CQL syntax?
>>> 
>>> On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh 
>>> mailto:rahul.xavier.si...@gmail.com>> wrote:
 I like this approach. Thank you for those working on this vector search 
 initiative. 
 
 Here's the feedback from my "user" hat for someone who is looking at 
 databases / indexes for my next LLM app. 
 
 Can I take some python code and go from using an in memory vector store 
 like numpy or FAISS to something else? How easy is it for me to take my 
 python code and get it to work with this new external service which is no 
 longer just a library?
 There's also tons of services that I can run on docker e.g. milvus, 
 redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle 
 when trying to do a lot more data, so I look at Cassandra Vector Search. 
 Because I am familiar with SQL , Cassandra looks appealing since I can 
 potentially use "cql_agent" lib ( to be created for langchain and we're 
 looking into that now) or an existing CassandraVectorStore class?
 
 In most of these scenarios, if people are using langchain, llamaindex, the 
 underlying implementation is not as important since we shield the user 
 from CQL data types except at schema creation and most of this libs can be 
 opinionated and just suggest a generic schema. 
 
 The ideal world is if I can just dump text into a field and do a natural 
 language query on it and have my DB do the embeddings for the document, 
 and then for the query for me. For now the libs can manage all that and 
 they do that well. We just need the interface to stay consistent and be 
 relatively easy to query in CQL. The most popular index in LLM retrieval 
 augmented patterns is pinecone. You make an index, you upsert, and then 
 you query. It's not assumed that you are also giving it content, though 
 you can send it metadata to have the document there. 
 
 If we can have a similar workflow e.g. create a table with a vector type 
 OR create a table with an existing type and then add an index to it, no 
 one is going to sleep over it as long as it works. Having the ability to 
 take a table that has data, and then add a vector index doesn't make it 
 any different than adding a new field since I've got to calculate the 
 embeddings anyways. 
 
 Would love to see how the CQL ends up looking like. 
 Rahul Singh
 Chief Executive Officer | Business Platform Architect
 m: 202.905.2818 e: rahul.si...@anant.us  li: 
 http://linkedin.com/in/xingh 
 
  ca: http://calendly.com/xingh 
 
 We create, support, and manage real-time global data & analytics platforms 
 for the modern enterprise.
 
 Anant | https://anant.us 
 
 3 Washington Circle, Suite 301
 Washington, D.C. 20037
 
 http://Cassandra.Link 
 
  : The best resources for Apache Cassandra
 
 
 On Tue, May 2, 2023 at 6:39 PM Patrick McFadin >>> > wrote:
> \o/
> 
> Bring it in team. Group hug. 
> 
> Now if you'll excuse me, I'm going to go build my preso on how Cassandra 
> is the only distributed database you can do vector search in an ACID 
> transaction. 
> 
> Patrick
> 
> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis  > wrote:
>> I had a call with David.  We agreed that we want a "vector" data type 
>> with these properties
>> 
>> - Fixed length
>> - No nulls
>> - Random access not supported
>> 

Re: [POLL] Vector type for ML

2023-05-03 Thread Jeremiah D Jordan
> To be clear, I support the general agreement David and Jonathan seem to have 
> reached.

+1 as well.


> On May 3, 2023, at 3:07 PM, Caleb Rackliffe  wrote:
> 
> To be clear, I support the general agreement David and Jonathan seem to have 
> reached.
> 
> On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe  > wrote:
>> Did we agree on a CQL syntax?
>> 
>> On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh 
>> mailto:rahul.xavier.si...@gmail.com>> wrote:
>>> I like this approach. Thank you for those working on this vector search 
>>> initiative. 
>>> 
>>> Here's the feedback from my "user" hat for someone who is looking at 
>>> databases / indexes for my next LLM app. 
>>> 
>>> Can I take some python code and go from using an in memory vector store 
>>> like numpy or FAISS to something else? How easy is it for me to take my 
>>> python code and get it to work with this new external service which is no 
>>> longer just a library?
>>> There's also tons of services that I can run on docker e.g. milvus, 
>>> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle 
>>> when trying to do a lot more data, so I look at Cassandra Vector Search. 
>>> Because I am familiar with SQL , Cassandra looks appealing since I can 
>>> potentially use "cql_agent" lib ( to be created for langchain and we're 
>>> looking into that now) or an existing CassandraVectorStore class?
>>> 
>>> In most of these scenarios, if people are using langchain, llamaindex, the 
>>> underlying implementation is not as important since we shield the user from 
>>> CQL data types except at schema creation and most of this libs can be 
>>> opinionated and just suggest a generic schema. 
>>> 
>>> The ideal world is if I can just dump text into a field and do a natural 
>>> language query on it and have my DB do the embeddings for the document, and 
>>> then for the query for me. For now the libs can manage all that and they do 
>>> that well. We just need the interface to stay consistent and be relatively 
>>> easy to query in CQL. The most popular index in LLM retrieval augmented 
>>> patterns is pinecone. You make an index, you upsert, and then you query. 
>>> It's not assumed that you are also giving it content, though you can send 
>>> it metadata to have the document there. 
>>> 
>>> If we can have a similar workflow e.g. create a table with a vector type OR 
>>> create a table with an existing type and then add an index to it, no one is 
>>> going to sleep over it as long as it works. Having the ability to take a 
>>> table that has data, and then add a vector index doesn't make it any 
>>> different than adding a new field since I've got to calculate the 
>>> embeddings anyways. 
>>> 
>>> Would love to see how the CQL ends up looking like. 
>>> Rahul Singh
>>> Chief Executive Officer | Business Platform Architect
>>> m: 202.905.2818 e: rahul.si...@anant.us  li: 
>>> http://linkedin.com/in/xingh 
>>> 
>>>  ca: http://calendly.com/xingh 
>>> 
>>> We create, support, and manage real-time global data & analytics platforms 
>>> for the modern enterprise.
>>> 
>>> Anant | https://anant.us 
>>> 
>>> 3 Washington Circle, Suite 301
>>> Washington, D.C. 20037
>>> 
>>> http://Cassandra.Link 
>>> 
>>>  : The best resources for Apache Cassandra
>>> 
>>> 
>>> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin >> > wrote:
 \o/
 
 Bring it in team. Group hug. 
 
 Now if you'll excuse me, I'm going to go build my preso on how Cassandra 
 is the only distributed database you can do vector search in an ACID 
 transaction. 
 
 Patrick
 
 On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis >>> > wrote:
> I had a call with David.  We agreed that we want a "vector" data type 
> with these properties
> 
> - Fixed length
> - No nulls
> - Random access not supported
> 
> Where we disagreed was on my proposal to restrict vectors to only numeric 
> data.  David's points were that
> 
> (1) He has a use case today for a data type with the other vector 
> properties,
> (2) It doesn't seem reasonable to create two data types with the same 
> properties, one of which is restricted to numerics, and
> (3) The restrictions that I 

Re: [POLL] Vector type for ML

2023-05-03 Thread Caleb Rackliffe
To be clear, I support the general agreement David and Jonathan seem to
have reached.

On Wed, May 3, 2023 at 3:05 PM Caleb Rackliffe 
wrote:

> Did we agree on a CQL syntax?
>
> On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh <
> rahul.xavier.si...@gmail.com> wrote:
>
>> I like this approach. Thank you for those working on this vector search
>> initiative.
>>
>> Here's the feedback from my "user" hat for someone who is looking at
>> databases / indexes for my next LLM app.
>>
>> Can I take some python code and go from using an in memory vector store
>> like numpy or FAISS to something else? How easy is it for me to take my
>> python code and get it to work with this new external service which is no
>> longer just a library?
>> There's also tons of services that I can run on docker e.g. milvus,
>> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle
>> when trying to do a lot more data, so I look at Cassandra Vector Search.
>> Because I am familiar with SQL , Cassandra looks appealing since I can
>> potentially use "cql_agent" lib ( to be created for langchain and we're
>> looking into that now) or an existing CassandraVectorStore class?
>>
>> In most of these scenarios, if people are using langchain, llamaindex,
>> the underlying implementation is not as important since we shield the user
>> from CQL data types except at schema creation and most of this libs can be
>> opinionated and just suggest a generic schema.
>>
>> The ideal world is if I can just dump text into a field and do a natural
>> language query on it and have my DB do the embeddings for the document, and
>> then for the query for me. For now the libs can manage all that and they do
>> that well. We just need the interface to stay consistent and be relatively
>> easy to query in CQL. The most popular index in LLM retrieval augmented
>> patterns is pinecone. You make an index, you upsert, and then you query.
>> It's not assumed that you are also giving it content, though you can send
>> it metadata to have the document there.
>>
>> If we can have a similar workflow e.g. create a table with a vector type
>> OR create a table with an existing type and then add an index to it, no one
>> is going to sleep over it as long as it works. Having the ability to take a
>> table that has data, and then add a vector index doesn't make it any
>> different than adding a new field since I've got to calculate the
>> embeddings anyways.
>>
>> Would love to see how the CQL ends up looking like.
>> Rahul Singh
>>
>> Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
>> rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
>> http://calendly.com/xingh
>>
>> *We create, support, and manage real-time global data & analytics
>> platforms for the modern enterprise.*
>>
>> *Anant | https://anant.us *
>>
>> 3 Washington Circle, Suite 301
>>
>> Washington, D.C. 20037
>>
>> *http://Cassandra.Link * : The best resources
>> for Apache Cassandra
>>
>>
>> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin 
>> wrote:
>>
>>> \o/
>>>
>>> Bring it in team. Group hug.
>>>
>>> Now if you'll excuse me, I'm going to go build my preso on how Cassandra
>>> is the only distributed database you can do vector search in an ACID
>>> transaction.
>>>
>>> Patrick
>>>
>>> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis  wrote:
>>>
 I had a call with David.  We agreed that we want a "vector" data type
 with these properties

 - Fixed length
 - No nulls
 - Random access not supported

 Where we disagreed was on my proposal to restrict vectors to only
 numeric data.  David's points were that

 (1) He has a use case today for a data type with the other vector
 properties,
 (2) It doesn't seem reasonable to create two data types with the same
 properties, one of which is restricted to numerics, and
 (3) The restrictions that I want for numeric vectors make more sense at
 the index and function level, than at the type level.

 I'm ready to concede that David has the better case here and move
 forward with a vector implementation without that restriction.

 On Tue, May 2, 2023 at 4:03 PM David Capwell 
 wrote:

>  How about it, David? Did you already make this?
>
>
> I checked out the patch, fixed serialize/deserialize, added the
> constraints, then added a composeForFloat(ByteBuffer), with this the 
> impact
> to the POC patch was the following
>
> 1) move away from VectorType.instance.serializer().deserialize(bb) to
> type.composeForFloat(bb), both return float[]
> 2) change the index validate logic to move away from checking
> VectorType and instead check for that plus the element type == FloatType.
> I didn’t bother to do this as its trivial
>
> David. End this argument. SHOW THE CODE!
>
>
> If this argument ends and people are cool with vector 

Re: [POLL] Vector type for ML

2023-05-03 Thread Caleb Rackliffe
Did we agree on a CQL syntax?

On Wed, May 3, 2023 at 2:06 PM Rahul Xavier Singh <
rahul.xavier.si...@gmail.com> wrote:

> I like this approach. Thank you for those working on this vector search
> initiative.
>
> Here's the feedback from my "user" hat for someone who is looking at
> databases / indexes for my next LLM app.
>
> Can I take some python code and go from using an in memory vector store
> like numpy or FAISS to something else? How easy is it for me to take my
> python code and get it to work with this new external service which is no
> longer just a library?
> There's also tons of services that I can run on docker e.g. milvus,
> redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle
> when trying to do a lot more data, so I look at Cassandra Vector Search.
> Because I am familiar with SQL , Cassandra looks appealing since I can
> potentially use "cql_agent" lib ( to be created for langchain and we're
> looking into that now) or an existing CassandraVectorStore class?
>
> In most of these scenarios, if people are using langchain, llamaindex, the
> underlying implementation is not as important since we shield the user from
> CQL data types except at schema creation and most of this libs can be
> opinionated and just suggest a generic schema.
>
> The ideal world is if I can just dump text into a field and do a natural
> language query on it and have my DB do the embeddings for the document, and
> then for the query for me. For now the libs can manage all that and they do
> that well. We just need the interface to stay consistent and be relatively
> easy to query in CQL. The most popular index in LLM retrieval augmented
> patterns is pinecone. You make an index, you upsert, and then you query.
> It's not assumed that you are also giving it content, though you can send
> it metadata to have the document there.
>
> If we can have a similar workflow e.g. create a table with a vector type
> OR create a table with an existing type and then add an index to it, no one
> is going to sleep over it as long as it works. Having the ability to take a
> table that has data, and then add a vector index doesn't make it any
> different than adding a new field since I've got to calculate the
> embeddings anyways.
>
> Would love to see how the CQL ends up looking like.
> Rahul Singh
>
> Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
> rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
> http://calendly.com/xingh
>
> *We create, support, and manage real-time global data & analytics
> platforms for the modern enterprise.*
>
> *Anant | https://anant.us *
>
> 3 Washington Circle, Suite 301
>
> Washington, D.C. 20037
>
> *http://Cassandra.Link * : The best resources for
> Apache Cassandra
>
>
> On Tue, May 2, 2023 at 6:39 PM Patrick McFadin  wrote:
>
>> \o/
>>
>> Bring it in team. Group hug.
>>
>> Now if you'll excuse me, I'm going to go build my preso on how Cassandra
>> is the only distributed database you can do vector search in an ACID
>> transaction.
>>
>> Patrick
>>
>> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis  wrote:
>>
>>> I had a call with David.  We agreed that we want a "vector" data type
>>> with these properties
>>>
>>> - Fixed length
>>> - No nulls
>>> - Random access not supported
>>>
>>> Where we disagreed was on my proposal to restrict vectors to only
>>> numeric data.  David's points were that
>>>
>>> (1) He has a use case today for a data type with the other vector
>>> properties,
>>> (2) It doesn't seem reasonable to create two data types with the same
>>> properties, one of which is restricted to numerics, and
>>> (3) The restrictions that I want for numeric vectors make more sense at
>>> the index and function level, than at the type level.
>>>
>>> I'm ready to concede that David has the better case here and move
>>> forward with a vector implementation without that restriction.
>>>
>>> On Tue, May 2, 2023 at 4:03 PM David Capwell  wrote:
>>>
  How about it, David? Did you already make this?


 I checked out the patch, fixed serialize/deserialize, added the
 constraints, then added a composeForFloat(ByteBuffer), with this the impact
 to the POC patch was the following

 1) move away from VectorType.instance.serializer().deserialize(bb) to
 type.composeForFloat(bb), both return float[]
 2) change the index validate logic to move away from checking
 VectorType and instead check for that plus the element type == FloatType.
 I didn’t bother to do this as its trivial

 David. End this argument. SHOW THE CODE!


 If this argument ends and people are cool with vector supporting
 abstract type, more than glad to help get this in.

 On May 2, 2023, at 1:53 PM, Jeremy Hanna 
 wrote:

 I'm all for bringing more functionality to the masses sooner, but the
 original idea has a very very specific use case.  Do we have use 

Re: [POLL] Vector type for ML

2023-05-03 Thread Rahul Xavier Singh
I like this approach. Thank you for those working on this vector search
initiative.

Here's the feedback from my "user" hat for someone who is looking at
databases / indexes for my next LLM app.

Can I take some python code and go from using an in memory vector store
like numpy or FAISS to something else? How easy is it for me to take my
python code and get it to work with this new external service which is no
longer just a library?
There's also tons of services that I can run on docker e.g. milvus,
redissearch, typesense, elasticsearch, opensearch and I may hit a hurdle
when trying to do a lot more data, so I look at Cassandra Vector Search.
Because I am familiar with SQL , Cassandra looks appealing since I can
potentially use "cql_agent" lib ( to be created for langchain and we're
looking into that now) or an existing CassandraVectorStore class?

In most of these scenarios, if people are using langchain, llamaindex, the
underlying implementation is not as important since we shield the user from
CQL data types except at schema creation and most of this libs can be
opinionated and just suggest a generic schema.

The ideal world is if I can just dump text into a field and do a natural
language query on it and have my DB do the embeddings for the document, and
then for the query for me. For now the libs can manage all that and they do
that well. We just need the interface to stay consistent and be relatively
easy to query in CQL. The most popular index in LLM retrieval augmented
patterns is pinecone. You make an index, you upsert, and then you query.
It's not assumed that you are also giving it content, though you can send
it metadata to have the document there.

If we can have a similar workflow e.g. create a table with a vector type OR
create a table with an existing type and then add an index to it, no one is
going to sleep over it as long as it works. Having the ability to take a
table that has data, and then add a vector index doesn't make it any
different than adding a new field since I've got to calculate the
embeddings anyways.

Would love to see how the CQL ends up looking like.
Rahul Singh

Chief Executive Officer | Business Platform Architect m: 202.905.2818 e:
rahul.si...@anant.us li: http://linkedin.com/in/xingh ca:
http://calendly.com/xingh

*We create, support, and manage real-time global data & analytics platforms
for the modern enterprise.*

*Anant | https://anant.us *

3 Washington Circle, Suite 301

Washington, D.C. 20037

*http://Cassandra.Link * : The best resources for
Apache Cassandra


On Tue, May 2, 2023 at 6:39 PM Patrick McFadin  wrote:

> \o/
>
> Bring it in team. Group hug.
>
> Now if you'll excuse me, I'm going to go build my preso on how Cassandra
> is the only distributed database you can do vector search in an ACID
> transaction.
>
> Patrick
>
> On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis  wrote:
>
>> I had a call with David.  We agreed that we want a "vector" data type
>> with these properties
>>
>> - Fixed length
>> - No nulls
>> - Random access not supported
>>
>> Where we disagreed was on my proposal to restrict vectors to only numeric
>> data.  David's points were that
>>
>> (1) He has a use case today for a data type with the other vector
>> properties,
>> (2) It doesn't seem reasonable to create two data types with the same
>> properties, one of which is restricted to numerics, and
>> (3) The restrictions that I want for numeric vectors make more sense at
>> the index and function level, than at the type level.
>>
>> I'm ready to concede that David has the better case here and move forward
>> with a vector implementation without that restriction.
>>
>> On Tue, May 2, 2023 at 4:03 PM David Capwell  wrote:
>>
>>>  How about it, David? Did you already make this?
>>>
>>>
>>> I checked out the patch, fixed serialize/deserialize, added the
>>> constraints, then added a composeForFloat(ByteBuffer), with this the impact
>>> to the POC patch was the following
>>>
>>> 1) move away from VectorType.instance.serializer().deserialize(bb) to
>>> type.composeForFloat(bb), both return float[]
>>> 2) change the index validate logic to move away from checking VectorType
>>> and instead check for that plus the element type == FloatType.  I didn’t
>>> bother to do this as its trivial
>>>
>>> David. End this argument. SHOW THE CODE!
>>>
>>>
>>> If this argument ends and people are cool with vector supporting
>>> abstract type, more than glad to help get this in.
>>>
>>> On May 2, 2023, at 1:53 PM, Jeremy Hanna 
>>> wrote:
>>>
>>> I'm all for bringing more functionality to the masses sooner, but the
>>> original idea has a very very specific use case.  Do we have use cases for
>>> a general purpose Vector/Array data structure?  If so, awesome.  I just
>>> wondered if generalizing provides value, beyond being straightforward to
>>> implement.  I'm just trying to be sensitive to the database code
>>> maintenance and driver support for general types 

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
\o/

Bring it in team. Group hug.

Now if you'll excuse me, I'm going to go build my preso on how Cassandra is
the only distributed database you can do vector search in an ACID
transaction.

Patrick

On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis  wrote:

> I had a call with David.  We agreed that we want a "vector" data type with
> these properties
>
> - Fixed length
> - No nulls
> - Random access not supported
>
> Where we disagreed was on my proposal to restrict vectors to only numeric
> data.  David's points were that
>
> (1) He has a use case today for a data type with the other vector
> properties,
> (2) It doesn't seem reasonable to create two data types with the same
> properties, one of which is restricted to numerics, and
> (3) The restrictions that I want for numeric vectors make more sense at
> the index and function level, than at the type level.
>
> I'm ready to concede that David has the better case here and move forward
> with a vector implementation without that restriction.
>
> On Tue, May 2, 2023 at 4:03 PM David Capwell  wrote:
>
>>  How about it, David? Did you already make this?
>>
>>
>> I checked out the patch, fixed serialize/deserialize, added the
>> constraints, then added a composeForFloat(ByteBuffer), with this the impact
>> to the POC patch was the following
>>
>> 1) move away from VectorType.instance.serializer().deserialize(bb) to
>> type.composeForFloat(bb), both return float[]
>> 2) change the index validate logic to move away from checking VectorType
>> and instead check for that plus the element type == FloatType.  I didn’t
>> bother to do this as its trivial
>>
>> David. End this argument. SHOW THE CODE!
>>
>>
>> If this argument ends and people are cool with vector supporting abstract
>> type, more than glad to help get this in.
>>
>> On May 2, 2023, at 1:53 PM, Jeremy Hanna 
>> wrote:
>>
>> I'm all for bringing more functionality to the masses sooner, but the
>> original idea has a very very specific use case.  Do we have use cases for
>> a general purpose Vector/Array data structure?  If so, awesome.  I just
>> wondered if generalizing provides value, beyond being straightforward to
>> implement.  I'm just trying to be sensitive to the database code
>> maintenance and driver support for general types versus a single type for a
>> specific, well defined purpose.
>>
>> If it could easily be a plugin, that's great - but the full picture
>> involves drivers that need to support it or you end up getting binary blobs
>> you have to decode client side and then do stuff with.  So ideally if you
>> have a well defined use case that you can build into the database, having
>> it just be part of the database and associated drivers - that makes the
>> experience much much better.
>>
>> I'm not trying to say B couldn't be valuable or that a plugin couldn't be
>> feasible.  I'm just trying to enlarge the picture a bit to see what that
>> means for this use case and for the supporting drivers/clients.
>>
>> On May 2, 2023, at 3:04 PM, Benedict  wrote:
>>
>> But it’s so trivial it was already implemented by David in the span of
>> ten minutes? If anything, we’re slowing progress down by refusing to do the
>> extra types, as we’re busy arguing about it rather than delivering a
>> feature?
>>
>> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
>> support types beyond float. Not that we should start with float.
>>
>> So, this whole debate is a mess, I think. But hey ho.
>>
>> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>>
>> 
>> I'll speak up on that one. If you look at my ranked voting, that is where
>> my head is. I get accused of scope creep (a lot) and looking at the initial
>> proposal Jonathan put on the ML it was mostly "Developers are adopting
>> vector search at a furious pace and I think I have a simple way of adding
>> support to keep Cassandra relevant for these use cases" Instead of just
>> focusing on this use case, I feel the arguments have bike shedded into
>> scope creep which means it will take forever to get into the project.
>>
>> My preference is to see one thing validated with an MVP and get it into
>> the hands of developers sooner so we can continue to iterate based on
>> actual usage.
>>
>> It doesn't say your points are wrong or your opinions are broken, I'm
>> voting for what I think will be awesome for users sooner.
>>
>> Patrick
>>
>> On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:
>>
>>> Could folk voting against a general purpose type (that could well be
>>> called a vector) briefly explain their reasoning?
>>>
>>> We established in the other thread that it’s technically trivial,
>>> meaning folk must think it is strictly superior to only support float
>>> rather than eg all numeric types (note: for the type, not the ANN).
>>>
>>> I am surprised, and the blurbs accompanying votes so far don’t seem to
>>> touch on this, mostly just endorsing the idea of a vector.
>>>
>>>
>>> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
>>>

Re: [POLL] Vector type for ML

2023-05-02 Thread Dinesh Joshi
I'm also in favor of having a general data type that is not tied to numeric 
data types alone.

On 2023/05/02 22:27:24 Jonathan Ellis wrote:
> I had a call with David.  We agreed that we want a "vector" data type with
> these properties
> 
> - Fixed length
> - No nulls
> - Random access not supported
> 
> Where we disagreed was on my proposal to restrict vectors to only numeric
> data.  David's points were that
> 
> (1) He has a use case today for a data type with the other vector
> properties,
> (2) It doesn't seem reasonable to create two data types with the same
> properties, one of which is restricted to numerics, and
> (3) The restrictions that I want for numeric vectors make more sense at the
> index and function level, than at the type level.
> 
> I'm ready to concede that David has the better case here and move forward
> with a vector implementation without that restriction.
> 
> On Tue, May 2, 2023 at 4:03 PM David Capwell  wrote:
> 
> >  How about it, David? Did you already make this?
> >
> >
> > I checked out the patch, fixed serialize/deserialize, added the
> > constraints, then added a composeForFloat(ByteBuffer), with this the impact
> > to the POC patch was the following
> >
> > 1) move away from VectorType.instance.serializer().deserialize(bb) to
> > type.composeForFloat(bb), both return float[]
> > 2) change the index validate logic to move away from checking VectorType
> > and instead check for that plus the element type == FloatType.  I didn’t
> > bother to do this as its trivial
> >
> > David. End this argument. SHOW THE CODE!
> >
> >
> > If this argument ends and people are cool with vector supporting abstract
> > type, more than glad to help get this in.
> >
> > On May 2, 2023, at 1:53 PM, Jeremy Hanna 
> > wrote:
> >
> > I'm all for bringing more functionality to the masses sooner, but the
> > original idea has a very very specific use case.  Do we have use cases for
> > a general purpose Vector/Array data structure?  If so, awesome.  I just
> > wondered if generalizing provides value, beyond being straightforward to
> > implement.  I'm just trying to be sensitive to the database code
> > maintenance and driver support for general types versus a single type for a
> > specific, well defined purpose.
> >
> > If it could easily be a plugin, that's great - but the full picture
> > involves drivers that need to support it or you end up getting binary blobs
> > you have to decode client side and then do stuff with.  So ideally if you
> > have a well defined use case that you can build into the database, having
> > it just be part of the database and associated drivers - that makes the
> > experience much much better.
> >
> > I'm not trying to say B couldn't be valuable or that a plugin couldn't be
> > feasible.  I'm just trying to enlarge the picture a bit to see what that
> > means for this use case and for the supporting drivers/clients.
> >
> > On May 2, 2023, at 3:04 PM, Benedict  wrote:
> >
> > But it’s so trivial it was already implemented by David in the span of ten
> > minutes? If anything, we’re slowing progress down by refusing to do the
> > extra types, as we’re busy arguing about it rather than delivering a
> > feature?
> >
> > FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
> > support types beyond float. Not that we should start with float.
> >
> > So, this whole debate is a mess, I think. But hey ho.
> >
> > On 2 May 2023, at 20:57, Patrick McFadin  wrote:
> >
> > 
> > I'll speak up on that one. If you look at my ranked voting, that is where
> > my head is. I get accused of scope creep (a lot) and looking at the initial
> > proposal Jonathan put on the ML it was mostly "Developers are adopting
> > vector search at a furious pace and I think I have a simple way of adding
> > support to keep Cassandra relevant for these use cases" Instead of just
> > focusing on this use case, I feel the arguments have bike shedded into
> > scope creep which means it will take forever to get into the project.
> >
> > My preference is to see one thing validated with an MVP and get it into
> > the hands of developers sooner so we can continue to iterate based on
> > actual usage.
> >
> > It doesn't say your points are wrong or your opinions are broken, I'm
> > voting for what I think will be awesome for users sooner.
> >
> > Patrick
> >
> > On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:
> >
> >> Could folk voting against a general purpose type (that could well be
> >> called a vector) briefly explain their reasoning?
> >>
> >> We established in the other thread that it’s technically trivial, meaning
> >> folk must think it is strictly superior to only support float rather than
> >> eg all numeric types (note: for the type, not the ANN).
> >>
> >> I am surprised, and the blurbs accompanying votes so far don’t seem to
> >> touch on this, mostly just endorsing the idea of a vector.
> >>
> >>
> >> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
> >>
> >> 
> >> A > B > C 

Re: [POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
I had a call with David.  We agreed that we want a "vector" data type with
these properties

- Fixed length
- No nulls
- Random access not supported

Where we disagreed was on my proposal to restrict vectors to only numeric
data.  David's points were that

(1) He has a use case today for a data type with the other vector
properties,
(2) It doesn't seem reasonable to create two data types with the same
properties, one of which is restricted to numerics, and
(3) The restrictions that I want for numeric vectors make more sense at the
index and function level, than at the type level.

I'm ready to concede that David has the better case here and move forward
with a vector implementation without that restriction.

On Tue, May 2, 2023 at 4:03 PM David Capwell  wrote:

>  How about it, David? Did you already make this?
>
>
> I checked out the patch, fixed serialize/deserialize, added the
> constraints, then added a composeForFloat(ByteBuffer), with this the impact
> to the POC patch was the following
>
> 1) move away from VectorType.instance.serializer().deserialize(bb) to
> type.composeForFloat(bb), both return float[]
> 2) change the index validate logic to move away from checking VectorType
> and instead check for that plus the element type == FloatType.  I didn’t
> bother to do this as its trivial
>
> David. End this argument. SHOW THE CODE!
>
>
> If this argument ends and people are cool with vector supporting abstract
> type, more than glad to help get this in.
>
> On May 2, 2023, at 1:53 PM, Jeremy Hanna 
> wrote:
>
> I'm all for bringing more functionality to the masses sooner, but the
> original idea has a very very specific use case.  Do we have use cases for
> a general purpose Vector/Array data structure?  If so, awesome.  I just
> wondered if generalizing provides value, beyond being straightforward to
> implement.  I'm just trying to be sensitive to the database code
> maintenance and driver support for general types versus a single type for a
> specific, well defined purpose.
>
> If it could easily be a plugin, that's great - but the full picture
> involves drivers that need to support it or you end up getting binary blobs
> you have to decode client side and then do stuff with.  So ideally if you
> have a well defined use case that you can build into the database, having
> it just be part of the database and associated drivers - that makes the
> experience much much better.
>
> I'm not trying to say B couldn't be valuable or that a plugin couldn't be
> feasible.  I'm just trying to enlarge the picture a bit to see what that
> means for this use case and for the supporting drivers/clients.
>
> On May 2, 2023, at 3:04 PM, Benedict  wrote:
>
> But it’s so trivial it was already implemented by David in the span of ten
> minutes? If anything, we’re slowing progress down by refusing to do the
> extra types, as we’re busy arguing about it rather than delivering a
> feature?
>
> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
> support types beyond float. Not that we should start with float.
>
> So, this whole debate is a mess, I think. But hey ho.
>
> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>
> 
> I'll speak up on that one. If you look at my ranked voting, that is where
> my head is. I get accused of scope creep (a lot) and looking at the initial
> proposal Jonathan put on the ML it was mostly "Developers are adopting
> vector search at a furious pace and I think I have a simple way of adding
> support to keep Cassandra relevant for these use cases" Instead of just
> focusing on this use case, I feel the arguments have bike shedded into
> scope creep which means it will take forever to get into the project.
>
> My preference is to see one thing validated with an MVP and get it into
> the hands of developers sooner so we can continue to iterate based on
> actual usage.
>
> It doesn't say your points are wrong or your opinions are broken, I'm
> voting for what I think will be awesome for users sooner.
>
> Patrick
>
> On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:
>
>> Could folk voting against a general purpose type (that could well be
>> called a vector) briefly explain their reasoning?
>>
>> We established in the other thread that it’s technically trivial, meaning
>> folk must think it is strictly superior to only support float rather than
>> eg all numeric types (note: for the type, not the ANN).
>>
>> I am surprised, and the blurbs accompanying votes so far don’t seem to
>> touch on this, mostly just endorsing the idea of a vector.
>>
>>
>> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
>>
>> 
>> A > B > C on both polls.
>>
>> Having talked to several users in the community that are highly excited
>> about this change, this gets to what developers want to do at Cassandra
>> scale: store embeddings and retrieve them.
>>
>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
>> wrote:
>>
>>> A > B > C
>>>
>>> I don't think that ML is such a niche application that it 

Re: [POLL] Vector type for ML

2023-05-02 Thread David Capwell
>  How about it, David? Did you already make this?

I checked out the patch, fixed serialize/deserialize, added the constraints, 
then added a composeForFloat(ByteBuffer), with this the impact to the POC patch 
was the following

1) move away from VectorType.instance.serializer().deserialize(bb) to 
type.composeForFloat(bb), both return float[]
2) change the index validate logic to move away from checking VectorType and 
instead check for that plus the element type == FloatType.  I didn’t bother to 
do this as its trivial

> David. End this argument. SHOW THE CODE! 

If this argument ends and people are cool with vector supporting abstract type, 
more than glad to help get this in.

> On May 2, 2023, at 1:53 PM, Jeremy Hanna  wrote:
> 
> I'm all for bringing more functionality to the masses sooner, but the 
> original idea has a very very specific use case.  Do we have use cases for a 
> general purpose Vector/Array data structure?  If so, awesome.  I just 
> wondered if generalizing provides value, beyond being straightforward to 
> implement.  I'm just trying to be sensitive to the database code maintenance 
> and driver support for general types versus a single type for a specific, 
> well defined purpose.
> 
> If it could easily be a plugin, that's great - but the full picture involves 
> drivers that need to support it or you end up getting binary blobs you have 
> to decode client side and then do stuff with.  So ideally if you have a well 
> defined use case that you can build into the database, having it just be part 
> of the database and associated drivers - that makes the experience much much 
> better.
> 
> I'm not trying to say B couldn't be valuable or that a plugin couldn't be 
> feasible.  I'm just trying to enlarge the picture a bit to see what that 
> means for this use case and for the supporting drivers/clients.
> 
>> On May 2, 2023, at 3:04 PM, Benedict  wrote:
>> 
>> But it’s so trivial it was already implemented by David in the span of ten 
>> minutes? If anything, we’re slowing progress down by refusing to do the 
>> extra types, as we’re busy arguing about it rather than delivering a feature?
>> 
>> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) 
>> support types beyond float. Not that we should start with float.
>> 
>> So, this whole debate is a mess, I think. But hey ho.
>> 
>>> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>>> 
>>> 
>>> I'll speak up on that one. If you look at my ranked voting, that is where 
>>> my head is. I get accused of scope creep (a lot) and looking at the initial 
>>> proposal Jonathan put on the ML it was mostly "Developers are adopting 
>>> vector search at a furious pace and I think I have a simple way of adding 
>>> support to keep Cassandra relevant for these use cases" Instead of just 
>>> focusing on this use case, I feel the arguments have bike shedded into 
>>> scope creep which means it will take forever to get into the project.
>>> 
>>> My preference is to see one thing validated with an MVP and get it into the 
>>> hands of developers sooner so we can continue to iterate based on actual 
>>> usage. 
>>> 
>>> It doesn't say your points are wrong or your opinions are broken, I'm 
>>> voting for what I think will be awesome for users sooner. 
>>> 
>>> Patrick
>>> 
>>> On Tue, May 2, 2023 at 12:29 PM Benedict >> > wrote:
 Could folk voting against a general purpose type (that could well be 
 called a vector) briefly explain their reasoning?
 
 We established in the other thread that it’s technically trivial, meaning 
 folk must think it is strictly superior to only support float rather than 
 eg all numeric types (note: for the type, not the ANN). 
 
 I am surprised, and the blurbs accompanying votes so far don’t seem to 
 touch on this, mostly just endorsing the idea of a vector.
 
 
> On 2 May 2023, at 20:20, Patrick McFadin  > wrote:
> 
> 
> A > B > C on both polls. 
> 
> Having talked to several users in the community that are highly excited 
> about this change, this gets to what developers want to do at Cassandra 
> scale: store embeddings and retrieve them. 
> 
> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña  > wrote:
>> A > B > C
>> 
>> I don't think that ML is such a niche application that it can't have its 
>> own CQL data type. Also, vectors are mathematical elements that have 
>> more applications that ML.
>> 
>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever > > wrote:
>>> 
>>> 
>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis >> > wrote:
 Should we add a vector type to Cassandra designed to meet the needs of 
 machine learning use cases, specifically feature and embedding vectors 
 for training, inference, 

Re: [POLL] Vector type for ML

2023-05-02 Thread Jeremy Hanna
I'm all for bringing more functionality to the masses sooner, but the original 
idea has a very very specific use case.  Do we have use cases for a general 
purpose Vector/Array data structure?  If so, awesome.  I just wondered if 
generalizing provides value, beyond being straightforward to implement.  I'm 
just trying to be sensitive to the database code maintenance and driver support 
for general types versus a single type for a specific, well defined purpose.

If it could easily be a plugin, that's great - but the full picture involves 
drivers that need to support it or you end up getting binary blobs you have to 
decode client side and then do stuff with.  So ideally if you have a well 
defined use case that you can build into the database, having it just be part 
of the database and associated drivers - that makes the experience much much 
better.

I'm not trying to say B couldn't be valuable or that a plugin couldn't be 
feasible.  I'm just trying to enlarge the picture a bit to see what that means 
for this use case and for the supporting drivers/clients.

> On May 2, 2023, at 3:04 PM, Benedict  wrote:
> 
> But it’s so trivial it was already implemented by David in the span of ten 
> minutes? If anything, we’re slowing progress down by refusing to do the extra 
> types, as we’re busy arguing about it rather than delivering a feature?
> 
> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) 
> support types beyond float. Not that we should start with float.
> 
> So, this whole debate is a mess, I think. But hey ho.
> 
>> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>> 
>> 
>> I'll speak up on that one. If you look at my ranked voting, that is where my 
>> head is. I get accused of scope creep (a lot) and looking at the initial 
>> proposal Jonathan put on the ML it was mostly "Developers are adopting 
>> vector search at a furious pace and I think I have a simple way of adding 
>> support to keep Cassandra relevant for these use cases" Instead of just 
>> focusing on this use case, I feel the arguments have bike shedded into scope 
>> creep which means it will take forever to get into the project.
>> 
>> My preference is to see one thing validated with an MVP and get it into the 
>> hands of developers sooner so we can continue to iterate based on actual 
>> usage. 
>> 
>> It doesn't say your points are wrong or your opinions are broken, I'm voting 
>> for what I think will be awesome for users sooner. 
>> 
>> Patrick
>> 
>> On Tue, May 2, 2023 at 12:29 PM Benedict > > wrote:
>>> Could folk voting against a general purpose type (that could well be called 
>>> a vector) briefly explain their reasoning?
>>> 
>>> We established in the other thread that it’s technically trivial, meaning 
>>> folk must think it is strictly superior to only support float rather than 
>>> eg all numeric types (note: for the type, not the ANN). 
>>> 
>>> I am surprised, and the blurbs accompanying votes so far don’t seem to 
>>> touch on this, mostly just endorsing the idea of a vector.
>>> 
>>> 
 On 2 May 2023, at 20:20, Patrick McFadin >>> > wrote:
 
 
 A > B > C on both polls. 
 
 Having talked to several users in the community that are highly excited 
 about this change, this gets to what developers want to do at Cassandra 
 scale: store embeddings and retrieve them. 
 
 On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña >>> > wrote:
> A > B > C
> 
> I don't think that ML is such a niche application that it can't have its 
> own CQL data type. Also, vectors are mathematical elements that have more 
> applications that ML.
> 
> On Tue, 2 May 2023 at 19:15, Mick Semb Wever  > wrote:
>> 
>> 
>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis > > wrote:
>>> Should we add a vector type to Cassandra designed to meet the needs of 
>>> machine learning use cases, specifically feature and embedding vectors 
>>> for training, inference, and vector search?  
>>> 
>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric 
>>> types, with no nulls allowed, and with no need for random access. The 
>>> ML industry overwhelmingly uses float32 vectors, to the point that the 
>>> industry-leading special-purpose vector database ONLY supports that 
>>> data type.
>>> 
>>> This poll is to gauge consensus subsequent to the recent discussion 
>>> thread at 
>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>> 
>>> Please rank the discussed options from most preferred option to least, 
>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C 
>>> > B = A (C is my preference, followed by B or A approximately equally.)
>>> 
>>> (A) I am in favor of adding a vector type for 

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
Yeah, it's a bit of a mess but mailing list yo. People reading this would
have no idea we are friends. ;) (Which we are, for anyone reading this
later!)

I must have missed the point of this already being done. How about it,
David? Did you already make this?

"FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
support types beyond float. Not that we should start with float"
That is not my interpretation and I can definitely see how that may be
frustrating. If B is pretty much done then we are good. My concern, as
noted earlier, is the scope creep component that will delay this happening
for much longer.

David. End this argument. SHOW THE CODE!

Patrick


On Tue, May 2, 2023 at 1:04 PM Benedict  wrote:

> But it’s so trivial it was already implemented by David in the span of ten
> minutes? If anything, we’re slowing progress down by refusing to do the
> extra types, as we’re busy arguing about it rather than delivering a
> feature?
>
> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
> support types beyond float. Not that we should start with float.
>
> So, this whole debate is a mess, I think. But hey ho.
>
> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>
> 
> I'll speak up on that one. If you look at my ranked voting, that is where
> my head is. I get accused of scope creep (a lot) and looking at the initial
> proposal Jonathan put on the ML it was mostly "Developers are adopting
> vector search at a furious pace and I think I have a simple way of adding
> support to keep Cassandra relevant for these use cases" Instead of just
> focusing on this use case, I feel the arguments have bike shedded into
> scope creep which means it will take forever to get into the project.
>
> My preference is to see one thing validated with an MVP and get it into
> the hands of developers sooner so we can continue to iterate based on
> actual usage.
>
> It doesn't say your points are wrong or your opinions are broken, I'm
> voting for what I think will be awesome for users sooner.
>
> Patrick
>
> On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:
>
>> Could folk voting against a general purpose type (that could well be
>> called a vector) briefly explain their reasoning?
>>
>> We established in the other thread that it’s technically trivial, meaning
>> folk must think it is strictly superior to only support float rather than
>> eg all numeric types (note: for the type, not the ANN).
>>
>> I am surprised, and the blurbs accompanying votes so far don’t seem to
>> touch on this, mostly just endorsing the idea of a vector.
>>
>>
>> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
>>
>> 
>> A > B > C on both polls.
>>
>> Having talked to several users in the community that are highly excited
>> about this change, this gets to what developers want to do at Cassandra
>> scale: store embeddings and retrieve them.
>>
>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
>> wrote:
>>
>>> A > B > C
>>>
>>> I don't think that ML is such a niche application that it can't have its
>>> own CQL data type. Also, vectors are mathematical elements that have more
>>> applications that ML.
>>>
>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:
>>>


 On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:

> Should we add a vector type to Cassandra designed to meet the needs of
> machine learning use cases, specifically feature and embedding vectors for
> training, inference, and vector search?
>
> ML vectors are fixed-dimension (fixed-length) sequences of numeric
> types, with no nulls allowed, and with no need for random access. The ML
> industry overwhelmingly uses float32 vectors, to the point that the
> industry-leading special-purpose vector database ONLY supports that data
> type.
>
> This poll is to gauge consensus subsequent to the recent discussion
> thread at
> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>
> Please rank the discussed options from most preferred option to least,
> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > 
> B
> = A (C is my preference, followed by B or A approximately equally.)
>
> (A) I am in favor of adding a vector type for floats; I do not believe
> we need to tie it to any particular implementation details.
>
> (B) I am okay with adding a vector type but I believe we must add
> array types that compose with all Cassandra types first, and make vectors 
> a
> special case of arrays-without-null-elements.
>
> (C) I am not in favor of adding a built-in vector type.
>



 A  > B > C

 B is stated as "must add array types…".  I think this is a bit loaded.
 If B was the (A + the implementation needs to be a non-null frozen float32
 array, serialisation forward compatible with other frozen arrays later
 implemented) I would put this before (A).  

Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
But it’s so trivial it was already implemented by David in the span of ten minutes? If anything, we’re slowing progress down by refusing to do the extra types, as we’re busy arguing about it rather than delivering a feature?FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) support types beyond float. Not that we should start with float.So, this whole debate is a mess, I think. But hey ho.On 2 May 2023, at 20:57, Patrick McFadin  wrote:I'll speak up on that one. If you look at my ranked voting, that is where my head is. I get accused of scope creep (a lot) and looking at the initial proposal Jonathan put on the ML it was mostly "Developers are adopting vector search at a furious pace and I think I have a simple way of adding support to keep Cassandra relevant for these use cases" Instead of just focusing on this use case, I feel the arguments have bike shedded into scope creep which means it will take forever to get into the project.My preference is to see one thing validated with an MVP and get it into the hands of developers sooner so we can continue to iterate based on actual usage. It doesn't say your points are wrong or your opinions are broken, I'm voting for what I think will be awesome for users sooner. PatrickOn Tue, May 2, 2023 at 12:29 PM Benedict  wrote:Could folk voting against a general purpose type (that could well be called a vector) briefly explain their reasoning?We established in the other thread that it’s technically trivial, meaning folk must think it is strictly superior to only support float rather than eg all numeric types (note: for the type, not the ANN). I am surprised, and the blurbs accompanying votes so far don’t seem to touch on this, mostly just endorsing the idea of a vector.On 2 May 2023, at 20:20, Patrick McFadin  wrote:A > B > C on both polls. Having talked to several users in the community that are highly excited about this change, this gets to what developers want to do at Cassandra scale: store embeddings and retrieve them. On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña  wrote:A > B > CI don't think that ML is such a niche application that it can't have its own CQL data type. Also, vectors are mathematical elements that have more applications that ML.On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search?  ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no need for random access. The ML industry overwhelmingly uses float32 vectors, to the point that the industry-leading special-purpose vector database ONLY supports that data type.This poll is to gauge consensus subsequent to the recent discussion thread at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.Please rank the discussed options from most preferred option to least, e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B = A (C is my preference, followed by B or A approximately equally.)(A) I am in favor of adding a vector type for floats; I do not believe we need to tie it to any particular implementation details.(B) I am okay with adding a vector type but I believe we must add array types that compose with all Cassandra types first, and make vectors a special case of arrays-without-null-elements.(C) I am not in favor of adding a built-in vector type.A  > B > CB is stated as "must add array types…".  I think this is a bit loaded.  If B was the (A + the implementation needs to be a non-null frozen float32 array, serialisation forward compatible with other frozen arrays later implemented) I would put this before (A).  Especially because it's been shown already this is easy to implement. 





Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
I'll speak up on that one. If you look at my ranked voting, that is where
my head is. I get accused of scope creep (a lot) and looking at the initial
proposal Jonathan put on the ML it was mostly "Developers are adopting
vector search at a furious pace and I think I have a simple way of adding
support to keep Cassandra relevant for these use cases" Instead of just
focusing on this use case, I feel the arguments have bike shedded into
scope creep which means it will take forever to get into the project.

My preference is to see one thing validated with an MVP and get it into the
hands of developers sooner so we can continue to iterate based on actual
usage.

It doesn't say your points are wrong or your opinions are broken, I'm
voting for what I think will be awesome for users sooner.

Patrick

On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:

> Could folk voting against a general purpose type (that could well be
> called a vector) briefly explain their reasoning?
>
> We established in the other thread that it’s technically trivial, meaning
> folk must think it is strictly superior to only support float rather than
> eg all numeric types (note: for the type, not the ANN).
>
> I am surprised, and the blurbs accompanying votes so far don’t seem to
> touch on this, mostly just endorsing the idea of a vector.
>
>
> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
>
> 
> A > B > C on both polls.
>
> Having talked to several users in the community that are highly excited
> about this change, this gets to what developers want to do at Cassandra
> scale: store embeddings and retrieve them.
>
> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
> wrote:
>
>> A > B > C
>>
>> I don't think that ML is such a niche application that it can't have its
>> own CQL data type. Also, vectors are mathematical elements that have more
>> applications that ML.
>>
>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:
>>
>>>
>>>
>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:
>>>
 Should we add a vector type to Cassandra designed to meet the needs of
 machine learning use cases, specifically feature and embedding vectors for
 training, inference, and vector search?

 ML vectors are fixed-dimension (fixed-length) sequences of numeric
 types, with no nulls allowed, and with no need for random access. The ML
 industry overwhelmingly uses float32 vectors, to the point that the
 industry-leading special-purpose vector database ONLY supports that data
 type.

 This poll is to gauge consensus subsequent to the recent discussion
 thread at
 https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.

 Please rank the discussed options from most preferred option to least,
 e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
 = A (C is my preference, followed by B or A approximately equally.)

 (A) I am in favor of adding a vector type for floats; I do not believe
 we need to tie it to any particular implementation details.

 (B) I am okay with adding a vector type but I believe we must add array
 types that compose with all Cassandra types first, and make vectors a
 special case of arrays-without-null-elements.

 (C) I am not in favor of adding a built-in vector type.

>>>
>>>
>>>
>>> A  > B > C
>>>
>>> B is stated as "must add array types…".  I think this is a bit loaded.
>>> If B was the (A + the implementation needs to be a non-null frozen float32
>>> array, serialisation forward compatible with other frozen arrays later
>>> implemented) I would put this before (A).  Especially because it's been
>>> shown already this is easy to implement.
>>>
>>>
>>>
>>


Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
Could folk voting against a general purpose type (that could well be called a vector) briefly explain their reasoning?We established in the other thread that it’s technically trivial, meaning folk must think it is strictly superior to only support float rather than eg all numeric types (note: for the type, not the ANN). I am surprised, and the blurbs accompanying votes so far don’t seem to touch on this, mostly just endorsing the idea of a vector.On 2 May 2023, at 20:20, Patrick McFadin  wrote:A > B > C on both polls. Having talked to several users in the community that are highly excited about this change, this gets to what developers want to do at Cassandra scale: store embeddings and retrieve them. On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña  wrote:A > B > CI don't think that ML is such a niche application that it can't have its own CQL data type. Also, vectors are mathematical elements that have more applications that ML.On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search?  ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no need for random access. The ML industry overwhelmingly uses float32 vectors, to the point that the industry-leading special-purpose vector database ONLY supports that data type.This poll is to gauge consensus subsequent to the recent discussion thread at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.Please rank the discussed options from most preferred option to least, e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B = A (C is my preference, followed by B or A approximately equally.)(A) I am in favor of adding a vector type for floats; I do not believe we need to tie it to any particular implementation details.(B) I am okay with adding a vector type but I believe we must add array types that compose with all Cassandra types first, and make vectors a special case of arrays-without-null-elements.(C) I am not in favor of adding a built-in vector type.A  > B > CB is stated as "must add array types…".  I think this is a bit loaded.  If B was the (A + the implementation needs to be a non-null frozen float32 array, serialisation forward compatible with other frozen arrays later implemented) I would put this before (A).  Especially because it's been shown already this is easy to implement. 




Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
A > B > C on both polls.

Having talked to several users in the community that are highly excited
about this change, this gets to what developers want to do at Cassandra
scale: store embeddings and retrieve them.

On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
wrote:

> A > B > C
>
> I don't think that ML is such a niche application that it can't have its
> own CQL data type. Also, vectors are mathematical elements that have more
> applications that ML.
>
> On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:
>
>>
>>
>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:
>>
>>> Should we add a vector type to Cassandra designed to meet the needs of
>>> machine learning use cases, specifically feature and embedding vectors for
>>> training, inference, and vector search?
>>>
>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric
>>> types, with no nulls allowed, and with no need for random access. The ML
>>> industry overwhelmingly uses float32 vectors, to the point that the
>>> industry-leading special-purpose vector database ONLY supports that data
>>> type.
>>>
>>> This poll is to gauge consensus subsequent to the recent discussion
>>> thread at
>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>
>>> Please rank the discussed options from most preferred option to least,
>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
>>> = A (C is my preference, followed by B or A approximately equally.)
>>>
>>> (A) I am in favor of adding a vector type for floats; I do not believe
>>> we need to tie it to any particular implementation details.
>>>
>>> (B) I am okay with adding a vector type but I believe we must add array
>>> types that compose with all Cassandra types first, and make vectors a
>>> special case of arrays-without-null-elements.
>>>
>>> (C) I am not in favor of adding a built-in vector type.
>>>
>>
>>
>>
>> A  > B > C
>>
>> B is stated as "must add array types…".  I think this is a bit loaded.
>> If B was the (A + the implementation needs to be a non-null frozen float32
>> array, serialisation forward compatible with other frozen arrays later
>> implemented) I would put this before (A).  Especially because it's been
>> shown already this is easy to implement.
>>
>>
>>
>


Re: [POLL] Vector type for ML

2023-05-02 Thread Andrés de la Peña
A > B > C

I don't think that ML is such a niche application that it can't have its
own CQL data type. Also, vectors are mathematical elements that have more
applications that ML.

On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:

>
>
> On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:
>
>> Should we add a vector type to Cassandra designed to meet the needs of
>> machine learning use cases, specifically feature and embedding vectors for
>> training, inference, and vector search?
>>
>> ML vectors are fixed-dimension (fixed-length) sequences of numeric types,
>> with no nulls allowed, and with no need for random access. The ML industry
>> overwhelmingly uses float32 vectors, to the point that the industry-leading
>> special-purpose vector database ONLY supports that data type.
>>
>> This poll is to gauge consensus subsequent to the recent discussion
>> thread at
>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>
>> Please rank the discussed options from most preferred option to least,
>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
>> = A (C is my preference, followed by B or A approximately equally.)
>>
>> (A) I am in favor of adding a vector type for floats; I do not believe we
>> need to tie it to any particular implementation details.
>>
>> (B) I am okay with adding a vector type but I believe we must add array
>> types that compose with all Cassandra types first, and make vectors a
>> special case of arrays-without-null-elements.
>>
>> (C) I am not in favor of adding a built-in vector type.
>>
>
>
>
> A  > B > C
>
> B is stated as "must add array types…".  I think this is a bit loaded.  If
> B was the (A + the implementation needs to be a non-null frozen float32
> array, serialisation forward compatible with other frozen arrays later
> implemented) I would put this before (A).  Especially because it's been
> shown already this is easy to implement.
>
>
>


Re: [POLL] Vector type for ML

2023-05-02 Thread Mick Semb Wever
On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:

> Should we add a vector type to Cassandra designed to meet the needs of
> machine learning use cases, specifically feature and embedding vectors for
> training, inference, and vector search?
>
> ML vectors are fixed-dimension (fixed-length) sequences of numeric types,
> with no nulls allowed, and with no need for random access. The ML industry
> overwhelmingly uses float32 vectors, to the point that the industry-leading
> special-purpose vector database ONLY supports that data type.
>
> This poll is to gauge consensus subsequent to the recent discussion thread
> at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>
> Please rank the discussed options from most preferred option to least,
> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
> = A (C is my preference, followed by B or A approximately equally.)
>
> (A) I am in favor of adding a vector type for floats; I do not believe we
> need to tie it to any particular implementation details.
>
> (B) I am okay with adding a vector type but I believe we must add array
> types that compose with all Cassandra types first, and make vectors a
> special case of arrays-without-null-elements.
>
> (C) I am not in favor of adding a built-in vector type.
>



A  > B > C

B is stated as "must add array types…".  I think this is a bit loaded.  If
B was the (A + the implementation needs to be a non-null frozen float32
array, serialisation forward compatible with other frozen arrays later
implemented) I would put this before (A).  Especially because it's been
shown already this is easy to implement.


Re: [POLL] Vector type for ML

2023-05-02 Thread David Capwell
> B) Should we introduce a type that is general purpose, and supports all 
> Cassandra types, so that this may be used to support ML (and perhaps other) 
> workloads

I vote B only as well...

> On May 2, 2023, at 9:02 AM, Benedict  wrote:
> 
> This is not the poll I thought we would be conducting, and I don’t really 
> support its framing. There are two parallel questions: what the functionality 
> should be and how they should be exposed. This poll compresses the 
> optionality poorly.
> 
> Whether or not we support a “vector” concept (or something isomorphic with 
> it), the first question this poll wants to answer is:
> 
> A) Should we introduce a new CQL collection type that is unique to ML and 
> *only* supports float32
> B) Should we introduce a type that is general purpose, and supports all 
> Cassandra types, so that this may be used to support ML (and perhaps other) 
> workloads
> C) Should we not introduce new types to CQL at all
> 
> For this question, I vote B only.
> 
> Once this question is answered it makes sense to answer how it will be 
> exposed semantically/syntactically. 
> 
> 
>> On 2 May 2023, at 16:43, Jonathan Ellis  wrote:
>> 
>> 
>> My preference: A > B > C.  Vectors are distinct enough from arrays that we 
>> should not make adding the latter a prerequisite for adding the former.
>> 
>> On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis > > wrote:
>>> Should we add a vector type to Cassandra designed to meet the needs of 
>>> machine learning use cases, specifically feature and embedding vectors for 
>>> training, inference, and vector search?  
>>> 
>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric types, 
>>> with no nulls allowed, and with no need for random access. The ML industry 
>>> overwhelmingly uses float32 vectors, to the point that the industry-leading 
>>> special-purpose vector database ONLY supports that data type.
>>> 
>>> This poll is to gauge consensus subsequent to the recent discussion thread 
>>> at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>> 
>>> Please rank the discussed options from most preferred option to least, 
>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B 
>>> = A (C is my preference, followed by B or A approximately equally.)
>>> 
>>> (A) I am in favor of adding a vector type for floats; I do not believe we 
>>> need to tie it to any particular implementation details.
>>> 
>>> (B) I am okay with adding a vector type but I believe we must add array 
>>> types that compose with all Cassandra types first, and make vectors a 
>>> special case of arrays-without-null-elements.
>>> 
>>> (C) I am not in favor of adding a built-in vector type.
>>> 
>>> -- 
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com 
>>> @spyced
>> 
>> 
>> -- 
>> Jonathan Ellis
>> co-founder, http://www.datastax.com 
>> @spyced



Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
This is not the poll I thought we would be conducting, and I don’t really support its framing. There are two parallel questions: what the functionality should be and how they should be exposed. This poll compresses the optionality poorly.Whether or not we support a “vector” concept (or something isomorphic with it), the first question this poll wants to answer is:A) Should we introduce a new CQL collection type that is unique to ML and *only* supports float32B) Should we introduce a type that is general purpose, and supports all Cassandra types, so that this may be used to support ML (and perhaps other) workloadsC) Should we not introduce new types to CQL at allFor this question, I vote B only.Once this question is answered it makes sense to answer how it will be exposed semantically/syntactically. On 2 May 2023, at 16:43, Jonathan Ellis  wrote:My preference: A > B > C.  Vectors are distinct enough from arrays that we should not make adding the latter a prerequisite for adding the former.On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis  wrote:Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search?  ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no need for random access. The ML industry overwhelmingly uses float32 vectors, to the point that the industry-leading special-purpose vector database ONLY supports that data type.This poll is to gauge consensus subsequent to the recent discussion thread at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.Please rank the discussed options from most preferred option to least, e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B = A (C is my preference, followed by B or A approximately equally.)(A) I am in favor of adding a vector type for floats; I do not believe we need to tie it to any particular implementation details.(B) I am okay with adding a vector type but I believe we must add array types that compose with all Cassandra types first, and make vectors a special case of arrays-without-null-elements.(C) I am not in favor of adding a built-in vector type.-- Jonathan Ellisco-founder, http://www.datastax.com@spyced
-- Jonathan Ellisco-founder, http://www.datastax.com@spyced


Re: [POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
My preference: A > B > C.  Vectors are distinct enough from arrays that we
should not make adding the latter a prerequisite for adding the former.

On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis  wrote:

> Should we add a vector type to Cassandra designed to meet the needs of
> machine learning use cases, specifically feature and embedding vectors for
> training, inference, and vector search?
>
> ML vectors are fixed-dimension (fixed-length) sequences of numeric types,
> with no nulls allowed, and with no need for random access. The ML industry
> overwhelmingly uses float32 vectors, to the point that the industry-leading
> special-purpose vector database ONLY supports that data type.
>
> This poll is to gauge consensus subsequent to the recent discussion thread
> at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>
> Please rank the discussed options from most preferred option to least,
> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
> = A (C is my preference, followed by B or A approximately equally.)
>
> (A) I am in favor of adding a vector type for floats; I do not believe we
> need to tie it to any particular implementation details.
>
> (B) I am okay with adding a vector type but I believe we must add array
> types that compose with all Cassandra types first, and make vectors a
> special case of arrays-without-null-elements.
>
> (C) I am not in favor of adding a built-in vector type.
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced