Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Benedict
If we agree we’re delivering some general purpose array type, that supports all types as elements (ie, is logicaly equivalent to a frozen list of fixed length, however it is actually implemented), I think we are in technical agreement and it’s just a matter of presentation.At which point I think we should simply collect the possible syntax options and put them to a poll. I’m not keen on vector for previously stated reasons, but it’s probably not worth litigating further and we should let the silent majority adjudicate.On 2 May 2023, at 12:43, Jonathan Ellis  wrote:To make sure I understand correctly -- are you saying that you're fine with a vector type, but you want to see it implemented as a special case of arrays, or that you are not fine with a vector type because you would prefer to only add arrays and that should be "good enough" for ML?On Mon, May 1, 2023 at 4:27 PM Benedict  wrote:A data type plug-in is actually really easy today, I think? But, developing further hooks should probably be thought through as they’re necessary. I think in this case it would be simpler to deliver a general purpose type, which is why I’m trying to propose types that would be acceptable.I also think we’re pretty close to agreement, really?But if not, let’s flesh out potential plug-in requirements.On 1 May 2023, at 21:58, Josh McKenzie  wrote:If we want to make an ML-specific data type, it should be in an ML plug-in.How can we encourage a healthier plug-in ecosystem? As far as I know it's been pretty anemic historically:cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.htmlpostgres: https://www.postgresql.org/docs/current/contrib.htmlI'm really interested to hear if there's more in the ecosystem I'm not aware of or if there's been strides made in this regard; users in the ecosystem being able to write durable extensions to Cassandra that they can then distribute and gain momentum could potentially be a great incubator for new features or functionality in the ecosystem.If our support for extensions remains as bare as I believe it to be, I wouldn't recommend anyone go that route.On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:I have explained repeatedly why I am opposed to ML-specific data types. If we want to make an ML-specific data type, it should be in an ML plug-in. We should not pollute the general purpose language with hastily-considered features that target specific bandwagons - at best partially - no matter how exciting the bandwagon.I think a simple and easy case can be made for fixed length array types that do not seem to create random bits of cruft in the language that dangle by themselves should this play not pan out. This is an easy way for this effort to make progress without negatively impacting the language.That is, unless we want to start supporting totally random types for every use case at the top level language layer. I don’t think this is a good idea, personally, and I’m quite confident we would now be regretting this approach had it been taken for earlier bandwagons.Nor do I think anyone’s priors about how successful this effort will be should matter. As a matter of principle, we should simply never deliver a specialist functionality as a high level CQL language feature without at least baking it for several years as a plug-in.On 1 May 2023, at 21:03, Mick Semb Wever  wrote:Yes!  What you (David) and Benedict write beautifully supports `VECTOR FLOAT[n]` imho.You are definitely bringing up valid implementation details, and that can be dealt with during patch review. This thread is about the CQL API addition.  No matter which way the technical review goes with the implementation details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML idiomatic approach and the best long-term CQL API.  It's a win-win situation – no matter how you look at it imho it is the best solution api wise.  Unless the suggestion is that an ideal implementation can give us a better CQL API – but I don't see what that could be.   Maybe the suggestion is we deny the possibility of using the VECTOR keyword and bring us back to something like `NON-NULL FROZEN`.   This is odd to me because `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the patch's audience and their idioms.  I have no problems with introducing such an alias to meet the ML crowd.Another way I think of this is `VECTOR FLOAT[n]` is the porcelain ML cql api, `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the general-use plumbing cql apis. This would allow implementation details to be moved out of this thread and to the review phase.On Mon, 1 May 2023 at 20:57, David Capwell  wrote:> I think it is totally reasonable that the ANN patch (and Jonathan) is not asked to implement on top of, or towards, other array (or other) new data types.   This impacts serialization, if you do not think about this day 1 you then can’t add later on 

Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Jonathan Ellis
To make sure I understand correctly -- are you saying that you're fine with
a vector type, but you want to see it implemented as a special case of
arrays, or that you are not fine with a vector type because you would
prefer to only add arrays and that should be "good enough" for ML?

On Mon, May 1, 2023 at 4:27 PM Benedict  wrote:

> A data type plug-in is actually really easy today, I think? But,
> developing further hooks should probably be thought through as they’re
> necessary.
>
> I think in this case it would be simpler to deliver a general purpose
> type, which is why I’m trying to propose types that would be acceptable.
>
> I also think we’re pretty close to agreement, really?
>
> But if not, let’s flesh out potential plug-in requirements.
>
>
> On 1 May 2023, at 21:58, Josh McKenzie  wrote:
>
> 
>
> If we want to make an ML-specific data type, it should be in an ML plug-in.
>
> How can we encourage a healthier plug-in ecosystem? As far as I know it's
> been pretty anemic historically:
>
> cassandra:
> https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html
> postgres: https://www.postgresql.org/docs/current/contrib.html
>
> I'm really interested to hear if there's more in the ecosystem I'm not
> aware of or if there's been strides made in this regard; users in the
> ecosystem being able to write durable extensions to Cassandra that they can
> then distribute and gain momentum could potentially be a great incubator
> for new features or functionality in the ecosystem.
>
> If our support for extensions remains as bare as I believe it to be, I
> wouldn't recommend anyone go that route.
>
> On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:
>
>
> I have explained repeatedly why I am opposed to ML-specific data types. If
> we want to make an ML-specific data type, it should be in an ML plug-in. We
> should not pollute the general purpose language with hastily-considered
> features that target specific bandwagons - at best partially - no matter
> how exciting the bandwagon.
>
> I think a simple and easy case can be made for fixed length array types
> that do not seem to create random bits of cruft in the language that dangle
> by themselves should this play not pan out. This is an easy way for this
> effort to make progress without negatively impacting the language.
>
> That is, unless we want to start supporting totally random types for every
> use case at the top level language layer. I don’t think this is a good
> idea, personally, and I’m quite confident we would now be regretting this
> approach had it been taken for earlier bandwagons.
>
> Nor do I think anyone’s priors about how successful this effort will be
> should matter. As a matter of principle, we should simply never deliver a
> specialist functionality as a high level CQL language feature without at
> least baking it for several years as a plug-in.
>
> On 1 May 2023, at 21:03, Mick Semb Wever  wrote:
>
> 
>
> Yes!  What you (David) and Benedict write beautifully supports `VECTOR
> FLOAT[n]` imho.
>
> You are definitely bringing up valid implementation details, and that can
> be dealt with during patch review. This thread is about the CQL API
> addition.
>
> No matter which way the technical review goes with the implementation
> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML
> idiomatic approach and the best long-term CQL API.  It's a win-win
> situation – no matter how you look at it imho it is the best solution api
> wise.
>
> Unless the suggestion is that an ideal implementation can give us a better
> CQL API – but I don't see what that could be.   Maybe the suggestion is we
> deny the possibility of using the VECTOR keyword and bring us back to
> something like `NON-NULL FROZEN`.   This is odd to me because
> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the
> patch's audience and their idioms.  I have no problems with introducing
> such an alias to meet the ML crowd.
>
> Another way I think of this is
>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>  `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the
> general-use plumbing cql apis.
>
> This would allow implementation details to be moved out of this thread and
> to the review phase.
>
>
>
>
> On Mon, 1 May 2023 at 20:57, David Capwell  wrote:
>
> > I think it is totally reasonable that the ANN patch (and Jonathan) is
> not asked to implement on top of, or towards, other array (or other) new
> data types.
>
>
> This impacts serialization, if you do not think about this day 1 you then
> can’t add later on without having to worry about migration and versioning…
>
> Honestly I wanted to better understand the cost to be generic and the
> impact to ANN, so I took
> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
> and made it handle every requirement I have listed so far (size, null, all
> types)… the current patch has several bugs at the type level that would
> 

Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Mick Semb Wever
I have no problem with `VECTOR` hanging around forever as an alias for
`NON-NULL FROZEN`.  Even without ANN, it makes sense and will stick with
new C* users.

A plug-in system would be great, but it shouldn't hold back this work imho.



On Mon, 1 May 2023 at 22:17, Benedict  wrote:

> I have explained repeatedly why I am opposed to ML-specific data types. If
> we want to make an ML-specific data type, it should be in an ML plug-in. We
> should not pollute the general purpose language with hastily-considered
> features that target specific bandwagons - at best partially - no matter
> how exciting the bandwagon.
>
> I think a simple and easy case can be made for fixed length array types
> that do not seem to create random bits of cruft in the language that dangle
> by themselves should this play not pan out. This is an easy way for this
> effort to make progress without negatively impacting the language.
>
> That is, unless we want to start supporting totally random types for every
> use case at the top level language layer. I don’t think this is a good
> idea, personally, and I’m quite confident we would now be regretting this
> approach had it been taken for earlier bandwagons.
>
> Nor do I think anyone’s priors about how successful this effort will be
> should matter. As a matter of principle, we should simply never deliver a
> specialist functionality as a high level CQL language feature without at
> least baking it for several years as a plug-in.
>
> On 1 May 2023, at 21:03, Mick Semb Wever  wrote:
>
> 
>
> Yes!  What you (David) and Benedict write beautifully supports `VECTOR
> FLOAT[n]` imho.
>
> You are definitely bringing up valid implementation details, and that can
> be dealt with during patch review. This thread is about the CQL API
> addition.
>
> No matter which way the technical review goes with the implementation
> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML
> idiomatic approach and the best long-term CQL API.  It's a win-win
> situation – no matter how you look at it imho it is the best solution api
> wise.
>
> Unless the suggestion is that an ideal implementation can give us a better
> CQL API – but I don't see what that could be.   Maybe the suggestion is we
> deny the possibility of using the VECTOR keyword and bring us back to
> something like `NON-NULL FROZEN`.   This is odd to me because
> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the
> patch's audience and their idioms.  I have no problems with introducing
> such an alias to meet the ML crowd.
>
> Another way I think of this is
>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>  `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the
> general-use plumbing cql apis.
>
> This would allow implementation details to be moved out of this thread and
> to the review phase.
>
>
>
>
> On Mon, 1 May 2023 at 20:57, David Capwell  wrote:
>
>> > I think it is totally reasonable that the ANN patch (and Jonathan) is
>> not asked to implement on top of, or towards, other array (or other) new
>> data types.
>>
>>
>> This impacts serialization, if you do not think about this day 1 you then
>> can’t add later on without having to worry about migration and versioning…
>>
>> Honestly I wanted to better understand the cost to be generic and the
>> impact to ANN, so I took
>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
>> and made it handle every requirement I have listed so far (size, null, all
>> types)… the current patch has several bugs at the type level that would
>> need to be fixed, so had to fix those as well…. Total time to do this was
>> 10 minutes… and this includes adding a method "public float[]
>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic
>> small (change VectorType.Serializer.instance.deserialize(buffer) to
>> type.composeAsFloats(buffer))….
>>
>> Did this have any impact to the final ByteBuffer?  Nope, it had identical
>> layout for the FloatType case, but works for all types…. I didn’t change
>> the fact we store the size (felt this could be removed, but then we could
>> never support expanding the vector in the future…)
>>
>> So, given the fact it takes a few minutes to implement all these
>> requirements, I do find it very reasonable to push back and say we should
>> make sure the new type is not leaking details from a special ANN index…. We
>> have spent more time debating this than it takes to support… we also have
>> fuzz testing on trunk so just updating
>> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new
>> type means we get type coverage as well…
>>
>> I have zero issues helping to review this patch and make sure the testing
>> is on-par with existing types (this is a strong requirement for me)
>>
>>
>> > On May 1, 2023, at 10:40 AM, Mick Semb Wever  wrote:
>> >
>> >
>> > > But suggesting that Jonathan should work on implementing general
>> purpose arrays 

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread J. D. Jordan
Yes. Plugging in a new type server side is very easy. Adding that type to every client is not.Cassandra already supports plugging in custom types through a jar.  What a given client does when encountering a custom type it doesn’t know about depends on the client.I was recently looking at this for DynamicCompositeType, which is a type shipped in C* but exposed through the custom type machinery.  Very few drivers implement support for it. I saw one driver just crash upon encountering it in the schema at startup. One crashed if you included such a column in a query. And one threw warnings and treated it as a binary blob.So as David said, the client side is per driver. Also I would recommend thinking about using the existing custom type stuff if possible so that we don’t have to roll the native protocol version to add new type enums and even though unknown customs types act different in each driver, they do mostly allow someone to plug-in an implementation for them.On May 1, 2023, at 5:12 PM, David Capwell  wrote:A data type plug-in is actually really easy today, I think? Sadly not, the client reads the class from our schema tables and has to have duplicate logic to serialize/deserialize results… types are easy to add if you are ok with client not understanding them (and will some clients fail due to every language having its own logic?)On May 1, 2023, at 2:26 PM, Benedict  wrote:A data type plug-in is actually really easy today, I think? But, developing further hooks should probably be thought through as they’re necessary. I think in this case it would be simpler to deliver a general purpose type, which is why I’m trying to propose types that would be acceptable.I also think we’re pretty close to agreement, really?But if not, let’s flesh out potential plug-in requirements.On 1 May 2023, at 21:58, Josh McKenzie  wrote:If we want to make an ML-specific data type, it should be in an ML plug-in.How can we encourage a healthier plug-in ecosystem? As far as I know it's been pretty anemic historically:cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.htmlpostgres: https://www.postgresql.org/docs/current/contrib.htmlI'm really interested to hear if there's more in the ecosystem I'm not aware of or if there's been strides made in this regard; users in the ecosystem being able to write durable extensions to Cassandra that they can then distribute and gain momentum could potentially be a great incubator for new features or functionality in the ecosystem.If our support for extensions remains as bare as I believe it to be, I wouldn't recommend anyone go that route.On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:I have explained repeatedly why I am opposed to ML-specific data types. If we want to make an ML-specific data type, it should be in an ML plug-in. We should not pollute the general purpose language with hastily-considered features that target specific bandwagons - at best partially - no matter how exciting the bandwagon.I think a simple and easy case can be made for fixed length array types that do not seem to create random bits of cruft in the language that dangle by themselves should this play not pan out. This is an easy way for this effort to make progress without negatively impacting the language.That is, unless we want to start supporting totally random types for every use case at the top level language layer. I don’t think this is a good idea, personally, and I’m quite confident we would now be regretting this approach had it been taken for earlier bandwagons.Nor do I think anyone’s priors about how successful this effort will be should matter. As a matter of principle, we should simply never deliver a specialist functionality as a high level CQL language feature without at least baking it for several years as a plug-in.On 1 May 2023, at 21:03, Mick Semb Wever  wrote:Yes!  What you (David) and Benedict write beautifully supports `VECTOR FLOAT[n]` imho.You are definitely bringing up valid implementation details, and that can be dealt with during patch review. This thread is about the CQL API addition.  No matter which way the technical review goes with the implementation details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML idiomatic approach and the best long-term CQL API.  It's a win-win situation – no matter how you look at it imho it is the best solution api wise.  Unless the suggestion is that an ideal implementation can give us a better CQL API – but I don't see what that could be.   Maybe the suggestion is we deny the possibility of using the VECTOR keyword and bring us back to something like `NON-NULL FROZEN`.   This is odd to me because `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the patch's audience and their idioms.  I have no problems with introducing such an alias to meet the ML crowd.Another way I think of this is `VECTOR FLOAT[n]` is the porcelain ML cql api, `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the 

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread David Capwell
> A data type plug-in is actually really easy today, I think? 

Sadly not, the client reads the class from our schema tables and has to have 
duplicate logic to serialize/deserialize results… types are easy to add if you 
are ok with client not understanding them (and will some clients fail due to 
every language having its own logic?)

> On May 1, 2023, at 2:26 PM, Benedict  wrote:
> 
> A data type plug-in is actually really easy today, I think? But, developing 
> further hooks should probably be thought through as they’re necessary. 
> 
> I think in this case it would be simpler to deliver a general purpose type, 
> which is why I’m trying to propose types that would be acceptable.
> 
> I also think we’re pretty close to agreement, really?
> 
> But if not, let’s flesh out potential plug-in requirements.
> 
> 
>> On 1 May 2023, at 21:58, Josh McKenzie  wrote:
>> 
>> 
>>> 
>>> If we want to make an ML-specific data type, it should be in an ML plug-in.
>> How can we encourage a healthier plug-in ecosystem? As far as I know it's 
>> been pretty anemic historically:
>> 
>> cassandra: 
>> https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html
>> postgres: https://www.postgresql.org/docs/current/contrib.html
>> 
>> I'm really interested to hear if there's more in the ecosystem I'm not aware 
>> of or if there's been strides made in this regard; users in the ecosystem 
>> being able to write durable extensions to Cassandra that they can then 
>> distribute and gain momentum could potentially be a great incubator for new 
>> features or functionality in the ecosystem.
>> 
>> If our support for extensions remains as bare as I believe it to be, I 
>> wouldn't recommend anyone go that route.
>> 
>> On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:
>>> 
>>> I have explained repeatedly why I am opposed to ML-specific data types. If 
>>> we want to make an ML-specific data type, it should be in an ML plug-in. We 
>>> should not pollute the general purpose language with hastily-considered 
>>> features that target specific bandwagons - at best partially - no matter 
>>> how exciting the bandwagon.
>>> 
>>> I think a simple and easy case can be made for fixed length array types 
>>> that do not seem to create random bits of cruft in the language that dangle 
>>> by themselves should this play not pan out. This is an easy way for this 
>>> effort to make progress without negatively impacting the language.
>>> 
>>> That is, unless we want to start supporting totally random types for every 
>>> use case at the top level language layer. I don’t think this is a good 
>>> idea, personally, and I’m quite confident we would now be regretting this 
>>> approach had it been taken for earlier bandwagons.
>>> 
>>> Nor do I think anyone’s priors about how successful this effort will be 
>>> should matter. As a matter of principle, we should simply never deliver a 
>>> specialist functionality as a high level CQL language feature without at 
>>> least baking it for several years as a plug-in.
>>> 
 On 1 May 2023, at 21:03, Mick Semb Wever  wrote:
 
 
 Yes!  What you (David) and Benedict write beautifully supports `VECTOR 
 FLOAT[n]` imho.
 
 You are definitely bringing up valid implementation details, and that can 
 be dealt with during patch review. This thread is about the CQL API 
 addition.  
 
 No matter which way the technical review goes with the implementation 
 details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML 
 idiomatic approach and the best long-term CQL API.  It's a win-win 
 situation – no matter how you look at it imho it is the best solution api 
 wise.  
 
 Unless the suggestion is that an ideal implementation can give us a better 
 CQL API – but I don't see what that could be.   Maybe the suggestion is we 
 deny the possibility of using the VECTOR keyword and bring us back to 
 something like `NON-NULL FROZEN`.   This is odd to me because 
 `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the 
 patch's audience and their idioms.  I have no problems with introducing 
 such an alias to meet the ML crowd.
 
 Another way I think of this is
  `VECTOR FLOAT[n]` is the porcelain ML cql api,
  `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the 
 general-use plumbing cql apis. 
 
 This would allow implementation details to be moved out of this thread and 
 to the review phase.
 
 
 
 
 On Mon, 1 May 2023 at 20:57, David Capwell >>> > wrote:
 > I think it is totally reasonable that the ANN patch (and Jonathan) is 
 > not asked to implement on top of, or towards, other array (or other) new 
 > data types.
 
 
 This impacts serialization, if you do not think about this day 1 you then 
 can’t add later on without having to worry about migration and versioning… 
 
 

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Benedict
A data type plug-in is actually really easy today, I think? But, developing further hooks should probably be thought through as they’re necessary. I think in this case it would be simpler to deliver a general purpose type, which is why I’m trying to propose types that would be acceptable.I also think we’re pretty close to agreement, really?But if not, let’s flesh out potential plug-in requirements.On 1 May 2023, at 21:58, Josh McKenzie  wrote:If we want to make an ML-specific data type, it should be in an ML plug-in.How can we encourage a healthier plug-in ecosystem? As far as I know it's been pretty anemic historically:cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.htmlpostgres: https://www.postgresql.org/docs/current/contrib.htmlI'm really interested to hear if there's more in the ecosystem I'm not aware of or if there's been strides made in this regard; users in the ecosystem being able to write durable extensions to Cassandra that they can then distribute and gain momentum could potentially be a great incubator for new features or functionality in the ecosystem.If our support for extensions remains as bare as I believe it to be, I wouldn't recommend anyone go that route.On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:I have explained repeatedly why I am opposed to ML-specific data types. If we want to make an ML-specific data type, it should be in an ML plug-in. We should not pollute the general purpose language with hastily-considered features that target specific bandwagons - at best partially - no matter how exciting the bandwagon.I think a simple and easy case can be made for fixed length array types that do not seem to create random bits of cruft in the language that dangle by themselves should this play not pan out. This is an easy way for this effort to make progress without negatively impacting the language.That is, unless we want to start supporting totally random types for every use case at the top level language layer. I don’t think this is a good idea, personally, and I’m quite confident we would now be regretting this approach had it been taken for earlier bandwagons.Nor do I think anyone’s priors about how successful this effort will be should matter. As a matter of principle, we should simply never deliver a specialist functionality as a high level CQL language feature without at least baking it for several years as a plug-in.On 1 May 2023, at 21:03, Mick Semb Wever  wrote:Yes!  What you (David) and Benedict write beautifully supports `VECTOR FLOAT[n]` imho.You are definitely bringing up valid implementation details, and that can be dealt with during patch review. This thread is about the CQL API addition.  No matter which way the technical review goes with the implementation details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML idiomatic approach and the best long-term CQL API.  It's a win-win situation – no matter how you look at it imho it is the best solution api wise.  Unless the suggestion is that an ideal implementation can give us a better CQL API – but I don't see what that could be.   Maybe the suggestion is we deny the possibility of using the VECTOR keyword and bring us back to something like `NON-NULL FROZEN`.   This is odd to me because `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the patch's audience and their idioms.  I have no problems with introducing such an alias to meet the ML crowd.Another way I think of this is `VECTOR FLOAT[n]` is the porcelain ML cql api, `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the general-use plumbing cql apis. This would allow implementation details to be moved out of this thread and to the review phase.On Mon, 1 May 2023 at 20:57, David Capwell  wrote:> I think it is totally reasonable that the ANN patch (and Jonathan) is not asked to implement on top of, or towards, other array (or other) new data types.   This impacts serialization, if you do not think about this day 1 you then can’t add later on without having to worry about migration and versioning…   Honestly I wanted to better understand the cost to be generic and the impact to ANN, so I took https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java and made it handle every requirement I have listed so far (size, null, all types)… the current patch has several bugs at the type level that would need to be fixed, so had to fix those as well…. Total time to do this was 10 minutes… and this includes adding a method "public float[] composeAsFloats(ByteBuffer bytes)” which made the change to existing logic small (change VectorType.Serializer.instance.deserialize(buffer) to type.composeAsFloats(buffer))….  Did this have any impact to the final ByteBuffer?  Nope, it had identical layout for the FloatType case, but works for all types…. I didn’t change the fact we store the size (felt this could be removed, but then we could never support 

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Josh McKenzie
> If we want to make an ML-specific data type, it should be in an ML plug-in.
How can we encourage a healthier plug-in ecosystem? As far as I know it's been 
pretty anemic historically:

cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html
postgres: https://www.postgresql.org/docs/current/contrib.html

I'm really interested to hear if there's more in the ecosystem I'm not aware of 
or if there's been strides made in this regard; users in the ecosystem being 
able to write durable extensions to Cassandra that they can then distribute and 
gain momentum could potentially be a great incubator for new features or 
functionality in the ecosystem.

If our support for extensions remains as bare as I believe it to be, I wouldn't 
recommend anyone go that route.

On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:
> 
> I have explained repeatedly why I am opposed to ML-specific data types. If we 
> want to make an ML-specific data type, it should be in an ML plug-in. We 
> should not pollute the general purpose language with hastily-considered 
> features that target specific bandwagons - at best partially - no matter how 
> exciting the bandwagon.
> 
> I think a simple and easy case can be made for fixed length array types that 
> do not seem to create random bits of cruft in the language that dangle by 
> themselves should this play not pan out. This is an easy way for this effort 
> to make progress without negatively impacting the language.
> 
> That is, unless we want to start supporting totally random types for every 
> use case at the top level language layer. I don’t think this is a good idea, 
> personally, and I’m quite confident we would now be regretting this approach 
> had it been taken for earlier bandwagons.
> 
> Nor do I think anyone’s priors about how successful this effort will be 
> should matter. As a matter of principle, we should simply never deliver a 
> specialist functionality as a high level CQL language feature without at 
> least baking it for several years as a plug-in.
> 
>> On 1 May 2023, at 21:03, Mick Semb Wever  wrote:
>> 
>> 
>> Yes!  What you (David) and Benedict write beautifully supports `VECTOR 
>> FLOAT[n]` imho.
>> 
>> You are definitely bringing up valid implementation details, and that can be 
>> dealt with during patch review. This thread is about the CQL API addition.  
>> 
>> No matter which way the technical review goes with the implementation 
>> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML 
>> idiomatic approach and the best long-term CQL API.  It's a win-win situation 
>> – no matter how you look at it imho it is the best solution api wise.  
>> 
>> Unless the suggestion is that an ideal implementation can give us a better 
>> CQL API – but I don't see what that could be.   Maybe the suggestion is we 
>> deny the possibility of using the VECTOR keyword and bring us back to 
>> something like `NON-NULL FROZEN`.   This is odd to me because 
>> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the 
>> patch's audience and their idioms.  I have no problems with introducing such 
>> an alias to meet the ML crowd.
>> 
>> Another way I think of this is
>>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>>  `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the 
>> general-use plumbing cql apis. 
>> 
>> This would allow implementation details to be moved out of this thread and 
>> to the review phase.
>> 
>> 
>> 
>> 
>> On Mon, 1 May 2023 at 20:57, David Capwell  wrote:
>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is not 
>>> > asked to implement on top of, or towards, other array (or other) new data 
>>> > types.
>>> 
>>> 
>>> This impacts serialization, if you do not think about this day 1 you then 
>>> can’t add later on without having to worry about migration and versioning… 
>>> 
>>> Honestly I wanted to better understand the cost to be generic and the 
>>> impact to ANN, so I took 
>>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
>>>  and made it handle every requirement I have listed so far (size, null, all 
>>> types)… the current patch has several bugs at the type level that would 
>>> need to be fixed, so had to fix those as well…. Total time to do this was 
>>> 10 minutes… and this includes adding a method "public float[] 
>>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic 
>>> small (change VectorType.Serializer.instance.deserialize(buffer) to 
>>> type.composeAsFloats(buffer))….
>>> 
>>> Did this have any impact to the final ByteBuffer?  Nope, it had identical 
>>> layout for the FloatType case, but works for all types…. I didn’t change 
>>> the fact we store the size (felt this could be removed, but then we could 
>>> never support expanding the vector in the future…)
>>> 
>>> So, given the fact it takes a few minutes to implement all these 
>>> requirements, I do find it 

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread David Capwell
> I think a simple and easy case can be made for fixed length array types that 
> do not seem to create random bits of cruft in the language that dangle by 
> themselves should this play not pan out. 

If I am understanding you correctly, then a "VECTOR FLOAT[n]” is fine as its a 
array type but has 2 new properties: fixed length, NON NULL elements… for 
optimization reasons we can always have a special composeForFloat(ByteBuffer) 
-> float[] which can be used by ANN (avoids boxing and java.util.List)

This makes sure the type doesn’t care about ML and can be used for other use 
cases, but for any ML systems that come after the fact they can special for 
this type… this is basically the change I made locally to test out the effort

> On May 1, 2023, at 1:17 PM, Benedict  wrote:
> 
> I have explained repeatedly why I am opposed to ML-specific data types. If we 
> want to make an ML-specific data type, it should be in an ML plug-in. We 
> should not pollute the general purpose language with hastily-considered 
> features that target specific bandwagons - at best partially - no matter how 
> exciting the bandwagon.
> 
> I think a simple and easy case can be made for fixed length array types that 
> do not seem to create random bits of cruft in the language that dangle by 
> themselves should this play not pan out. This is an easy way for this effort 
> to make progress without negatively impacting the language.
> 
> That is, unless we want to start supporting totally random types for every 
> use case at the top level language layer. I don’t think this is a good idea, 
> personally, and I’m quite confident we would now be regretting this approach 
> had it been taken for earlier bandwagons.
> 
> Nor do I think anyone’s priors about how successful this effort will be 
> should matter. As a matter of principle, we should simply never deliver a 
> specialist functionality as a high level CQL language feature without at 
> least baking it for several years as a plug-in.
> 
>> On 1 May 2023, at 21:03, Mick Semb Wever  wrote:
>> 
>> 
>> 
>> Yes!  What you (David) and Benedict write beautifully supports `VECTOR 
>> FLOAT[n]` imho.
>> 
>> You are definitely bringing up valid implementation details, and that can be 
>> dealt with during patch review. This thread is about the CQL API addition.  
>> 
>> No matter which way the technical review goes with the implementation 
>> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML 
>> idiomatic approach and the best long-term CQL API.  It's a win-win situation 
>> – no matter how you look at it imho it is the best solution api wise.  
>> 
>> Unless the suggestion is that an ideal implementation can give us a better 
>> CQL API – but I don't see what that could be.   Maybe the suggestion is we 
>> deny the possibility of using the VECTOR keyword and bring us back to 
>> something like `NON-NULL FROZEN`.   This is odd to me because 
>> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the 
>> patch's audience and their idioms.  I have no problems with introducing such 
>> an alias to meet the ML crowd.
>> 
>> Another way I think of this is
>>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>>  `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the 
>> general-use plumbing cql apis. 
>> 
>> This would allow implementation details to be moved out of this thread and 
>> to the review phase.
>> 
>> 
>> 
>> 
>> On Mon, 1 May 2023 at 20:57, David Capwell > > wrote:
>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is not 
>>> > asked to implement on top of, or towards, other array (or other) new data 
>>> > types.
>>> 
>>> 
>>> This impacts serialization, if you do not think about this day 1 you then 
>>> can’t add later on without having to worry about migration and versioning… 
>>> 
>>> Honestly I wanted to better understand the cost to be generic and the 
>>> impact to ANN, so I took 
>>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
>>>  and made it handle every requirement I have listed so far (size, null, all 
>>> types)… the current patch has several bugs at the type level that would 
>>> need to be fixed, so had to fix those as well…. Total time to do this was 
>>> 10 minutes… and this includes adding a method "public float[] 
>>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic 
>>> small (change VectorType.Serializer.instance.deserialize(buffer) to 
>>> type.composeAsFloats(buffer))….
>>> 
>>> Did this have any impact to the final ByteBuffer?  Nope, it had identical 
>>> layout for the FloatType case, but works for all types…. I didn’t change 
>>> the fact we store the size (felt this could be removed, but then we could 
>>> never support expanding the vector in the future…)
>>> 
>>> So, given the fact it takes a few minutes to implement all these 
>>> requirements, I do find it very reasonable to 

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Benedict
I have explained repeatedly why I am opposed to ML-specific data types. If we want to make an ML-specific data type, it should be in an ML plug-in. We should not pollute the general purpose language with hastily-considered features that target specific bandwagons - at best partially - no matter how exciting the bandwagon.I think a simple and easy case can be made for fixed length array types that do not seem to create random bits of cruft in the language that dangle by themselves should this play not pan out. This is an easy way for this effort to make progress without negatively impacting the language.That is, unless we want to start supporting totally random types for every use case at the top level language layer. I don’t think this is a good idea, personally, and I’m quite confident we would now be regretting this approach had it been taken for earlier bandwagons.Nor do I think anyone’s priors about how successful this effort will be should matter. As a matter of principle, we should simply never deliver a specialist functionality as a high level CQL language feature without at least baking it for several years as a plug-in.On 1 May 2023, at 21:03, Mick Semb Wever  wrote:Yes!  What you (David) and Benedict write beautifully supports `VECTOR FLOAT[n]` imho.You are definitely bringing up valid implementation details, and that can be dealt with during patch review. This thread is about the CQL API addition.  No matter which way the technical review goes with the implementation details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML idiomatic approach and the best long-term CQL API.  It's a win-win situation – no matter how you look at it imho it is the best solution api wise.  Unless the suggestion is that an ideal implementation can give us a better CQL API – but I don't see what that could be.   Maybe the suggestion is we deny the possibility of using the VECTOR keyword and bring us back to something like `NON-NULL FROZEN`.   This is odd to me because `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the patch's audience and their idioms.  I have no problems with introducing such an alias to meet the ML crowd.Another way I think of this is `VECTOR FLOAT[n]` is the porcelain ML cql api, `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the general-use plumbing cql apis. This would allow implementation details to be moved out of this thread and to the review phase.On Mon, 1 May 2023 at 20:57, David Capwell  wrote:> I think it is totally reasonable that the ANN patch (and Jonathan) is not asked to implement on top of, or towards, other array (or other) new data types.


This impacts serialization, if you do not think about this day 1 you then can’t add later on without having to worry about migration and versioning… 

Honestly I wanted to better understand the cost to be generic and the impact to ANN, so I took https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java and made it handle every requirement I have listed so far (size, null, all types)… the current patch has several bugs at the type level that would need to be fixed, so had to fix those as well…. Total time to do this was 10 minutes… and this includes adding a method "public float[] composeAsFloats(ByteBuffer bytes)” which made the change to existing logic small (change VectorType.Serializer.instance.deserialize(buffer) to type.composeAsFloats(buffer))….

Did this have any impact to the final ByteBuffer?  Nope, it had identical layout for the FloatType case, but works for all types…. I didn’t change the fact we store the size (felt this could be removed, but then we could never support expanding the vector in the future…)

So, given the fact it takes a few minutes to implement all these requirements, I do find it very reasonable to push back and say we should make sure the new type is not leaking details from a special ANN index…. We have spent more time debating this than it takes to support… we also have fuzz testing on trunk so just updating org.apache.cassandra.utils.AbstractTypeGenerators to know about this new type means we get type coverage as well…

I have zero issues helping to review this patch and make sure the testing is on-par with existing types (this is a strong requirement for me)


> On May 1, 2023, at 10:40 AM, Mick Semb Wever  wrote:
> 
> 
> > But suggesting that Jonathan should work on implementing general purpose arrays seems to fall outside the scope of this discussion, since the result of such work wouldn't even fill the need Jonathan is targeting for here. 
> 
> Every comment I have made so far I have argued that the v1 work doesn’t need to do some things, but that the limitations proposed so far are not real requirements; there is a big difference between what “could be allowed” and what is implemented day one… I am pushing back on what “could be allowed”, so far every justification has been that 

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Mick Semb Wever
Yes!  What you (David) and Benedict write beautifully supports `VECTOR
FLOAT[n]` imho.

You are definitely bringing up valid implementation details, and that can
be dealt with during patch review. This thread is about the CQL API
addition.

No matter which way the technical review goes with the implementation
details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML
idiomatic approach and the best long-term CQL API.  It's a win-win
situation – no matter how you look at it imho it is the best solution api
wise.

Unless the suggestion is that an ideal implementation can give us a better
CQL API – but I don't see what that could be.   Maybe the suggestion is we
deny the possibility of using the VECTOR keyword and bring us back to
something like `NON-NULL FROZEN`.   This is odd to me because
`VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the
patch's audience and their idioms.  I have no problems with introducing
such an alias to meet the ML crowd.

Another way I think of this is
 `VECTOR FLOAT[n]` is the porcelain ML cql api,
 `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the
general-use plumbing cql apis.

This would allow implementation details to be moved out of this thread and
to the review phase.




On Mon, 1 May 2023 at 20:57, David Capwell  wrote:

> > I think it is totally reasonable that the ANN patch (and Jonathan) is
> not asked to implement on top of, or towards, other array (or other) new
> data types.
>
>
> This impacts serialization, if you do not think about this day 1 you then
> can’t add later on without having to worry about migration and versioning…
>
> Honestly I wanted to better understand the cost to be generic and the
> impact to ANN, so I took
> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
> and made it handle every requirement I have listed so far (size, null, all
> types)… the current patch has several bugs at the type level that would
> need to be fixed, so had to fix those as well…. Total time to do this was
> 10 minutes… and this includes adding a method "public float[]
> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic
> small (change VectorType.Serializer.instance.deserialize(buffer) to
> type.composeAsFloats(buffer))….
>
> Did this have any impact to the final ByteBuffer?  Nope, it had identical
> layout for the FloatType case, but works for all types…. I didn’t change
> the fact we store the size (felt this could be removed, but then we could
> never support expanding the vector in the future…)
>
> So, given the fact it takes a few minutes to implement all these
> requirements, I do find it very reasonable to push back and say we should
> make sure the new type is not leaking details from a special ANN index…. We
> have spent more time debating this than it takes to support… we also have
> fuzz testing on trunk so just updating
> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new
> type means we get type coverage as well…
>
> I have zero issues helping to review this patch and make sure the testing
> is on-par with existing types (this is a strong requirement for me)
>
>
> > On May 1, 2023, at 10:40 AM, Mick Semb Wever  wrote:
> >
> >
> > > But suggesting that Jonathan should work on implementing general
> purpose arrays seems to fall outside the scope of this discussion, since
> the result of such work wouldn't even fill the need Jonathan is targeting
> for here.
> >
> > Every comment I have made so far I have argued that the v1 work doesn’t
> need to do some things, but that the limitations proposed so far are not
> real requirements; there is a big difference between what “could be
> allowed” and what is implemented day one… I am pushing back on what “could
> be allowed”, so far every justification has been that it slows down the ANN
> work…
> >
> > Simple examples of this already exists in C* (every example could be
> enhanced logically, we just have yet to put in the work)
> >
> > * updating a element of a list is only allowed for multi-cell
> > * appending to a list is only allowed for multi-cell
> > * etc.
> >
> > By saying that the type "shall not support", you actively block future
> work and future possibilities...
> >
> >
> >
> > I am coming around strongly to the `VECTOR FLOAT[n]` option.
> >
> > This gives Jonathan the simplest path right now with ths ANN work, while
> also ensuring the CQL API gets the best future potential.
> >
> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null
> and frozen, and that allows both today and in the future, as desired, for
> its implementation to be entirely different to `FLOAT[n]`.  This addresses
> a number of people's concerns that we meet ML's idioms head on.
> >
> > IMHO it feels like it will fit into the ideal future CQL , where all
> `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and
> maybe VECTOR BYTE[n]). This will also permit 

Re: [DISCUSS] New data type for vector search

2023-05-01 Thread David Capwell
> I think it is totally reasonable that the ANN patch (and Jonathan) is not 
> asked to implement on top of, or towards, other array (or other) new data 
> types.


This impacts serialization, if you do not think about this day 1 you then can’t 
add later on without having to worry about migration and versioning… 

Honestly I wanted to better understand the cost to be generic and the impact to 
ANN, so I took 
https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
 and made it handle every requirement I have listed so far (size, null, all 
types)… the current patch has several bugs at the type level that would need to 
be fixed, so had to fix those as well…. Total time to do this was 10 minutes… 
and this includes adding a method "public float[] composeAsFloats(ByteBuffer 
bytes)” which made the change to existing logic small (change 
VectorType.Serializer.instance.deserialize(buffer) to 
type.composeAsFloats(buffer))….

Did this have any impact to the final ByteBuffer?  Nope, it had identical 
layout for the FloatType case, but works for all types…. I didn’t change the 
fact we store the size (felt this could be removed, but then we could never 
support expanding the vector in the future…)

So, given the fact it takes a few minutes to implement all these requirements, 
I do find it very reasonable to push back and say we should make sure the new 
type is not leaking details from a special ANN index…. We have spent more time 
debating this than it takes to support… we also have fuzz testing on trunk so 
just updating org.apache.cassandra.utils.AbstractTypeGenerators to know about 
this new type means we get type coverage as well…

I have zero issues helping to review this patch and make sure the testing is 
on-par with existing types (this is a strong requirement for me)


> On May 1, 2023, at 10:40 AM, Mick Semb Wever  wrote:
> 
> 
> > But suggesting that Jonathan should work on implementing general purpose 
> > arrays seems to fall outside the scope of this discussion, since the result 
> > of such work wouldn't even fill the need Jonathan is targeting for here. 
> 
> Every comment I have made so far I have argued that the v1 work doesn’t need 
> to do some things, but that the limitations proposed so far are not real 
> requirements; there is a big difference between what “could be allowed” and 
> what is implemented day one… I am pushing back on what “could be allowed”, so 
> far every justification has been that it slows down the ANN work…
> 
> Simple examples of this already exists in C* (every example could be enhanced 
> logically, we just have yet to put in the work)
> 
> * updating a element of a list is only allowed for multi-cell
> * appending to a list is only allowed for multi-cell
> * etc.
> 
> By saying that the type "shall not support", you actively block future work 
> and future possibilities...
> 
> 
> 
> I am coming around strongly to the `VECTOR FLOAT[n]` option.
> 
> This gives Jonathan the simplest path right now with ths ANN work, while also 
> ensuring the CQL API gets the best future potential.
> 
> With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null and 
> frozen, and that allows both today and in the future, as desired, for its 
> implementation to be entirely different to `FLOAT[n]`.  This addresses a 
> number of people's concerns that we meet ML's idioms head on.
> 
> IMHO it feels like it will fit into the ideal future CQL , where all 
> `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and maybe 
> VECTOR BYTE[n]). This will also permit in the future `FROZEN` 
> if we wanted nulls in frozen arrays.
> 
> I think it is totally reasonable that the ANN patch (and Jonathan) is not 
> asked to implement on top of, or towards, other array (or other) new data 
> types.
> 
> I also think it is correct that we think about the evolution of CQL's API,  
> and how it might exist in the future when we have both ml vectors and general 
> use arrays.



Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Benedict
Has anybody yet claimed it would be hard? Several folk seem ready to jump to 
the conclusion that this would be onerous, but as somebody with a good 
understanding of the storage layer I can assert with reasonable confidence that 
it would not be. As previously stated, the implementation largely already 
exists for frozen lists.

If we are going to let difficulty of implementation inform our CQL evolution, 
my view is that the bar for additional difficulty should be high, as CQL 
changes need to be well considered as they are not easily revisited - bad 
decisions survive indefinitely. The alternative as David points out is a 
plug-in system.

So, maybe let’s wait until somebody makes a specific and serious claim of how 
challenging it would be, with justification, before we jump to compromising our 
language evolution based on it. I’m not even sure yet that this is really a 
consideration by anyone involved.

> On 1 May 2023, at 18:41, Mick Semb Wever  wrote:
> 
> 
>> 
>> > But suggesting that Jonathan should work on implementing general purpose 
>> > arrays seems to fall outside the scope of this discussion, since the 
>> > result of such work wouldn't even fill the need Jonathan is targeting for 
>> > here. 
>> 
>> Every comment I have made so far I have argued that the v1 work doesn’t need 
>> to do some things, but that the limitations proposed so far are not real 
>> requirements; there is a big difference between what “could be allowed” and 
>> what is implemented day one… I am pushing back on what “could be allowed”, 
>> so far every justification has been that it slows down the ANN work…
>> 
>> Simple examples of this already exists in C* (every example could be 
>> enhanced logically, we just have yet to put in the work)
>> 
>> * updating a element of a list is only allowed for multi-cell
>> * appending to a list is only allowed for multi-cell
>> * etc.
>> 
>> By saying that the type "shall not support", you actively block future work 
>> and future possibilities...
> 
> 
> 
> I am coming around strongly to the `VECTOR FLOAT[n]` option.
> 
> This gives Jonathan the simplest path right now with ths ANN work, while also 
> ensuring the CQL API gets the best future potential.
> 
> With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null and 
> frozen, and that allows both today and in the future, as desired, for its 
> implementation to be entirely different to `FLOAT[n]`.  This addresses a 
> number of people's concerns that we meet ML's idioms head on.
> 
> IMHO it feels like it will fit into the ideal future CQL , where all 
> `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and maybe 
> VECTOR BYTE[n]). This will also permit in the future `FROZEN` 
> if we wanted nulls in frozen arrays.
> 
> I think it is totally reasonable that the ANN patch (and Jonathan) is not 
> asked to implement on top of, or towards, other array (or other) new data 
> types.
> 
> I also think it is correct that we think about the evolution of CQL's API,  
> and how it might exist in the future when we have both ml vectors and general 
> use arrays.


Re: [DISCUSS] New data type for vector search

2023-05-01 Thread Mick Semb Wever
>
>
> > But suggesting that Jonathan should work on implementing general purpose
> arrays seems to fall outside the scope of this discussion, since the result
> of such work wouldn't even fill the need Jonathan is targeting for here.
>
> Every comment I have made so far I have argued that the v1 work doesn’t
> need to do some things, but that the limitations proposed so far are not
> real requirements; there is a big difference between what “could be
> allowed” and what is implemented day one… I am pushing back on what “could
> be allowed”, so far every justification has been that it slows down the ANN
> work…
>
> Simple examples of this already exists in C* (every example could be
> enhanced logically, we just have yet to put in the work)
>
> * updating a element of a list is only allowed for multi-cell
> * appending to a list is only allowed for multi-cell
> * etc.
>
> By saying that the type "shall not support", you actively block future
> work and future possibilities...
>



I am coming around strongly to the `VECTOR FLOAT[n]` option.

This gives Jonathan the simplest path right now with ths ANN work, while
also ensuring the CQL API gets the best future potential.

With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null and
frozen, and that allows both today and in the future, as desired, for its
implementation to be entirely different to `FLOAT[n]`.  This addresses a
number of people's concerns that we meet ML's idioms head on.

IMHO it feels like it will fit into the ideal future CQL , where all `
primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and maybe
VECTOR BYTE[n]). This will also permit in the future `FROZEN`
if we wanted nulls in frozen arrays.

I think it is totally reasonable that the ANN patch (and Jonathan) is not
asked to implement on top of, or towards, other array (or other) new data
types.

I also think it is correct that we think about the evolution of CQL's API,
 and how it might exist in the future when we have both ml vectors and
general use arrays.


Re: [DISCUSS] New data type for vector search

2023-05-01 Thread David Capwell
> In particular it makes no sense at all from an ML perspective to have vector 
> types of anything other than numerics

Back to what Benedict was saying, if the proposal was a ML pluggin, then this 
limitation makes sense, but that is not the proposal at hand.  If you wish to 
change the scope to add pluggable types, then this type of plugin could follow 
whatever rules it desires.

> but we have no reasonable path towards supporting indexing and searches 
> against any other vector type.

Type system is different than the index system, the index system is allowed to 
limit the domain of possible types to a blessed set… so arguing that a new type 
should be added that has limitations for indexing doesn’t make much sense to 
me, as those are index specific limitations…

> 4. Add a vector type that composes with all Cassandra types.  I can't see a 
> reason to do this, nobody wants it, and we killed the most similar proposal 
> in the past as wontfix.

We don’t have a constraint system at the moment, so such constrains are 
normally implemented in application… so I would not argue that nobody would 
want "VECTOR”…

> Benedict, I don't quite see why that matters? 

The global public API is different than a plugin system… Right now we allow 
pluggable SSTables, and Indexes… both could add any limitation they desire as 
they are plugins… but when we are working with the top level systems we do need 
to worry about compatibility…

> The argument is merely that this kind of vector, for this use case, a) is 
> different from arrays, and b) arrays apparently don't serve the use case well 
> enough (or at all).

I have listed every requirement so far, and they are all constrains on arrays… 
I am not arguing that this new type should “extend” from array type in java, or 
that CQL has any ability to convert between the two types… but every single 
requirement given so far are a constraint against arrays…

> But suggesting that Jonathan should work on implementing general purpose 
> arrays seems to fall outside the scope of this discussion, since the result 
> of such work wouldn't even fill the need Jonathan is targeting for here. 

Every comment I have made so far I have argued that the v1 work doesn’t need to 
do some things, but that the limitations proposed so far are not real 
requirements; there is a big difference between what “could be allowed” and 
what is implemented day one… I am pushing back on what “could be allowed”, so 
far every justification has been that it slows down the ANN work…

Simple examples of this already exists in C* (every example could be enhanced 
logically, we just have yet to put in the work)

* updating a element of a list is only allowed for multi-cell
* appending to a list is only allowed for multi-cell
* etc.

By saying that the type "shall not support", you actively block future work and 
future possibilities...

> At most these say “ANN indexes should only support float types” which is 
> different, and not something I would dispute.


Agree here, the type limitation is a limitation for ANN, so should leave in ANN 
and not leak outside there.

> By my superficial reading I get the impression that the main distinction is 
> that vectors don't need to support random access into a single element/float

They don’t “need” for for this single use case, that doesn’t mean that no user 
will ever wish to write “SELECT my_vector[42]”.  Do we “need” to add such 
support day 1?  No.  Does Jonathan have to implement this once the ANN work is 
merged?  No.  I just want to be clear I am pushing back that the type should 
never allow, as every justification has been specific to a specific use case, 
and that the broader type having such capabilities does not actually impact the 
ANN work.

> I haven't looked at what Jonathan is doing, but I assume, and it seems 
> Jonathan assumes or knows that this makes implementation both easier and 
> allows for important optimizations

His patch has a function that goes from BB -> float[].  The fact a vector could 
support non-float types does not actually impact that work, as you would still 
do BB -> float[], you just need the index to validate that the type is 
vector at the index create step (which you have to do already 
reguardless)…

The same is true for multi-cell support, you “could" have a function that takes 
a List -> float[]… if people really feel that we shouldn’t allow 
multi-cell, that’s fine by me…  The biggest limitation for this I see is that 
SAI works at the Cell level, but talking with Caleb that is a short term 
limitation and something desired to improve (treating frozen vs non-frozen 
differently isn’t desired).  



To repeat the list of requirements I summarized so far, this is everything I 
have seen

1) represents a fixed size array of element type
2) element may not be null
3) works for all types

I removed the frozen one, but as far as I can tell from the ANN work this isn’t 
a requirement for ANN, it just needs a way to 

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Henrik Ingo
By my superficial reading I get the impression that the main distinction is
that vectors don't need to support random access into a single
element/float. I haven't looked at what Jonathan is doing, but I assume,
and it seems Jonathan assumes or knows that this makes implementation both
easier and allows for important optimizations. Am I following correctly
here?

(Apologies if that is what your #1 is saying, I read yours as something
about secondary or maybe clustered indexes?)

Agree with #3 obviously.

#2... Vectors actually *could* support ordered (n-dimensional) indexes,
since they are vectors. But in practice it seems even asking for a simple
3D index is too much and too niche for anything else than Postgis.

henrik

henrik

On Fri, Apr 28, 2023 at 8:50 PM Benedict  wrote:

> I and others have claimed that an array concept will work, since it is
> isomorphic with a vector. I have seen the following counterclaims:
>
> 1. Vectors don’t need to support index lookups
> 2. Vectors don’t need to support ordered indexes
> 3. Vectors don’t need to support other types besides float
>
> None of these say that *vectors are not arrays*. At most these say “ANN
> indexes should only support float types” which is different, and not
> something I would dispute.
>
> If the claim is "there is no concept of arrays that is compatible with
> vector search" then let’s focus on that, because that is probably the
> initial source of the disconnect.
>
>
>
>
> On 28 Apr 2023, at 18:13, Henrik Ingo  wrote:
>
> 
> Benedict, I don't quite see why that matters? The argument is merely that
> this kind of vector, for this use case, a) is different from arrays, and b)
> arrays apparently don't serve the use case well enough (or at all).
>
> Now, if from the above it follows a discussion that a vector type cannot
> be a first class Cassandra type... that is of course a possible argument.
>
> But suggesting that Jonathan should work on implementing general purpose
> arrays seems to fall outside the scope of this discussion, since the result
> of such work wouldn't even fill the need Jonathan is targeting for here. I
> could also ask Jonathan to work on a JSONB data type, and it similarly
> would not be an interesting proposal to Jonathan, as it wouldn't fill the
> need for the specific use case he is targeting.
>
>
> But back to the main question... Why wouldn't a "vector for floats" type
> be general purpose enough that it should be delegated to some plugin?
> Machine Learning is a broad field in itself, with dozens of algorithms you
> could choose to use to build an AI model. And AI can be used in pretty much
> every industry vertical. If anything, I would claim DECIMAL is much more an
> industry specific special case type than these ML vectors would be.
>
>
>
> Back to Jonathan:
> >So in order of what makes sense to me:
> > 1. Add a vector type for just floats; consider adding bytes later if
> demand materializes. This gives us 99% of the value and limits the scope so
> we can deliver quickly.
> > 2. Add a vector type for floats or bytes. This gives us another 1% of
> value in exchange for an extra 20% or so of effort.
>
> Is it possible to implement 1 in a way that makes 2 possible in a future
> version?
>
> henrik
>
>
> henrik
>
> On Fri, Apr 28, 2023 at 7:33 PM Benedict  wrote:
>
>> pgvector is a plug-in. If you were proposing a plug-in you could ignore
>> these considerations.
>>
>> On 28 Apr 2023, at 16:58, Jonathan Ellis  wrote:
>>
>> 
>> I'm proposing a vector data type for ML use cases.  It's not the same
>> thing as an array or a list and it's not supposed to be.
>>
>> While it's true that it would be possible to build a vector type on top
>> of an array type, it's not necessary to do it that way, and given the lack
>> of interest in an array type for its own sake I don't see why we would want
>> to make that a requirement.
>>
>> It's relevant that pgvector, which among the systems offering vector
>> search is based on the most similar system to Cassandra in terms of its
>> query language, adds a vector data type that only supports floats *even
>> though postgresql already has an array data type* because the semantics are
>> different.  Random access doesn't make sense, string and collection and
>> other datatypes don't make sense, typical ordered indexes don't make sense,
>> etc.  It's just a different beast from arrays, for a different use case.
>>
>> On Fri, Apr 28, 2023 at 10:40 AM Benedict  wrote:
>>
>>> But you’re proposing introducing a general purpose type - this isn’t an
>>> ML plug-in, it’s modifying the core language in a manner that makes
>>> targeting your workload easier. Which is fine, but that means you have to
>>> consider its impact on the general language, not just your target use case.
>>>
>>> On 28 Apr 2023, at 16:29, Jonathan Ellis  wrote:
>>>
>>> 
>>> That's exactly right.
>>>
>>> In particular it makes no sense at all from an ML perspective to have
>>> vector types of anything other than 

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Patrick McFadin
>
> So is the goal here to provide something specific and idiomatic for the ML
> community or is the goal to make a primitive that's C*-centric that then
> another layer can write to? I personally argue for the former; I don't see
> this specific data type going away any time soon.


+1 on this concept. We could invite an entirely new class of users into
Cassandra by using familiar syntax. I was surprised that DENSE got nuked so
quickly since it is used in the ML world. [1][2][3]

Patrick

1.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.linalg.DenseVector.html
2. https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense
3. https://www.pinecone.io/learn/dense-vector-embeddings-nlp/

On Thu, Apr 27, 2023 at 5:49 PM Josh McKenzie  wrote:

> From a machine learning perspective, vectors are a well-known concept that
> are effectively immutable fixed-length n-dimensional values that are then
> later used either as part of a model or in conjunction with a model after
> the fact.
>
> While we could have this be non-frozen and not call it a vector, I'd be
> inclined to still make the argument for a layer of syntactic sugar on top
> that met ML users where they were with concepts they understood rather than
> forcing them through the cognitive lift of figuring out the Cassandra
> specific contortions to replicate something that's ubiquitous in their
> space. We did the same "Cassandra-first" approach with our JSON support and
> that didn't do us any favors in terms of adoption and usage as far as I
> know.
>
> So is the goal here to provide something specific and idiomatic for the ML
> community or is the goal to make a primitive that's C*-centric that then
> another layer can write to? I personally argue for the former; I don't see
> this specific data type going away any time soon.
>
> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>
> but as you point out it has the problem of allowing nulls.
>
>
> If nulls are not allowed for the elements, then either we need  a) a new
> type, or b) add some way to say elements may not be null…. As much as I do
> like b, I am leaning towards new type for this use case.
>
> So, to flesh out the type requirements I have seen so far
>
> 1) represents a fixed size array of element type
> * on write path we will need to validate this
> 2) element may not be null
> * on write path we will need to validate this
> 3) “frozen” (is this really a requirement for the type or is this
> just simpler for the ANN work?  I feel that this shouldn’t be a requirement)
> 4) works for all types (my requirement; original proposal is float only,
> but could logically expand to primitive types)
>
> Anything else?
>
> The key thing about a vector is that unlike lists or tuples you really
> don't care about individual elements, you care about doing vector and
> matrix multiplications with the thing as a unit.
>
>
> That maybe true for this use case, but “should” this be true for the type
> itself?  I feel like no… if a user wants the Nth element of a vector why
> would we block them?  I am not saying the first patch, or even 5.0 adds
> support for index access, I am just trying to push back saying that the
> type should not block this.
>
> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT
> VECTOR[N].)
>
>
> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I
> prefer this syntax but that limitation may not be desired for all use
> cases… we could always add LIST and ARRAY later
> to address that case.
>
> In terms of syntax I have seen, here is my ordered preference:
>
> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this
> semantic…. Could even be NON NULL TYPE[size]
>
> On Apr 27, 2023, at 9:00 AM, Benedict  wrote:
>
>
> That’s a bounded ring buffer, not a fixed length array.
>
> This definitely isn’t a tuple because the types are all the same, which is
> pretty crucial for matrix operations. Matrix libraries generally work on
> arrays of known dimensionality, or sparse representations.
>
> Whether we draw any semantic link between the frozen list and whatever we
> do here, it is fundamentally a frozen list with a restriction on its size.
> What we’re defining here are “statically” sized arrays, whereas a frozen
> list is essentially a dynamically sized array.
>
> I do not think vector is a good name because vector is used in some other
> popular languages to mean a (dynamic) list, which is confusing when we also
> have a list concept.
>
> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link
> with list. Though it is a bit strange that this particular type declaration
> looks so different to other collection types.
>
> On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:
>
> 
>
>
> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:
>
> It's been a while, so I may be missing something, but do we already have
> 

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Benedict
I and others have claimed that an array concept will work, since it is isomorphic with a vector. I have seen the following counterclaims:1. Vectors don’t need to support index lookups2. Vectors don’t need to support ordered indexes3. Vectors don’t need to support other types besides floatNone of these say that vectors are not arrays. At most these say “ANN indexes should only support float types” which is different, and not something I would dispute.If the claim is "there is no concept of arrays that is compatible with vector search" then let’s focus on that, because that is probably the initial source of the disconnect.On 28 Apr 2023, at 18:13, Henrik Ingo  wrote:Benedict, I don't quite see why that matters? The argument is merely that this kind of vector, for this use case, a) is different from arrays, and b) arrays apparently don't serve the use case well enough (or at all).Now, if from the above it follows a discussion that a vector type cannot be a first class Cassandra type... that is of course a possible argument. But suggesting that Jonathan should work on implementing general purpose arrays seems to fall outside the scope of this discussion, since the result of such work wouldn't even fill the need Jonathan is targeting for here. I could also ask Jonathan to work on a JSONB data type, and it similarly would not be an interesting proposal to Jonathan, as it wouldn't fill the need for the specific use case he is targeting.But back to the main question... Why wouldn't a "vector for floats" type be general purpose enough that it should be delegated to some plugin? Machine Learning is a broad field in itself, with dozens of algorithms you could choose to use to build an AI model. And AI can be used in pretty much every industry vertical. If anything, I would claim DECIMAL is much more an industry specific special case type than these ML vectors would be. Back to Jonathan:>So in order of what makes sense to me:> 1. Add a vector type for just floats; consider adding bytes later if demand materializes. This gives us 99% of the value and limits the scope so we can deliver quickly.> 2. Add a vector type for floats or bytes. This gives us another 1% of value in exchange for an extra 20% or so of effort.Is it possible to implement 1 in a way that makes 2 possible in a future version?henrikhenrikOn Fri, Apr 28, 2023 at 7:33 PM Benedict  wrote:pgvector is a plug-in. If you were proposing a plug-in you could ignore these considerations.On 28 Apr 2023, at 16:58, Jonathan Ellis  wrote:I'm proposing a vector data type for ML use cases.  It's not the same thing as an array or a list and it's not supposed to be.While it's true that it would be possible to build a vector type on top of an array type, it's not necessary to do it that way, and given the lack of interest in an array type for its own sake I don't see why we would want to make that a requirement.It's relevant that pgvector, which among the systems offering vector search is based on the most similar system to Cassandra in terms of its query language, adds a vector data type that only supports floats *even though postgresql already has an array data type* because the semantics are different.  Random access doesn't make sense, string and collection and other datatypes don't make sense, typical ordered indexes don't make sense, etc.  It's just a different beast from arrays, for a different use case.On Fri, Apr 28, 2023 at 10:40 AM Benedict  wrote:But you’re proposing introducing a general purpose type - this isn’t an ML plug-in, it’s modifying the core language in a manner that makes targeting your workload easier. Which is fine, but that means you have to consider its impact on the general language, not just your target use case.On 28 Apr 2023, at 16:29, Jonathan Ellis  wrote:That's exactly right.In particular it makes no sense at all from an ML perspective to have vector types of anything other than numerics.  And as I mentioned in the POC thread (but I did not mention here), float is overwhelmingly the most frequently used vector type, to the point that Pinecone (by far the most popular vector search engine) ONLY supports that type.Lucene and Elastic also add support for vectors of bytes (8-bit ints), which are useful for optimizing models that you have already built with floats, but we have no reasonable path towards supporting indexing and searches against any other vector type.So in order of what makes sense to me:1. Add a vector type for just floats; consider adding bytes later if demand materializes. This gives us 99% of the value and limits the scope so we can deliver quickly.2. Add a vector type for floats or bytes. This gives us another 1% of value in exchange for an extra 20% or so of effort.3. Add a vector type for all numeric primitives, but you can only index floats and bytes.  I think this is confusing to users and a bad idea.4. Add a vector type that composes 

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Henrik Ingo
Benedict, I don't quite see why that matters? The argument is merely that
this kind of vector, for this use case, a) is different from arrays, and b)
arrays apparently don't serve the use case well enough (or at all).

Now, if from the above it follows a discussion that a vector type cannot be
a first class Cassandra type... that is of course a possible argument.

But suggesting that Jonathan should work on implementing general purpose
arrays seems to fall outside the scope of this discussion, since the result
of such work wouldn't even fill the need Jonathan is targeting for here. I
could also ask Jonathan to work on a JSONB data type, and it similarly
would not be an interesting proposal to Jonathan, as it wouldn't fill the
need for the specific use case he is targeting.


But back to the main question... Why wouldn't a "vector for floats" type be
general purpose enough that it should be delegated to some plugin? Machine
Learning is a broad field in itself, with dozens of algorithms you could
choose to use to build an AI model. And AI can be used in pretty much every
industry vertical. If anything, I would claim DECIMAL is much more an
industry specific special case type than these ML vectors would be.



Back to Jonathan:
>So in order of what makes sense to me:
> 1. Add a vector type for just floats; consider adding bytes later if
demand materializes. This gives us 99% of the value and limits the scope so
we can deliver quickly.
> 2. Add a vector type for floats or bytes. This gives us another 1% of
value in exchange for an extra 20% or so of effort.

Is it possible to implement 1 in a way that makes 2 possible in a future
version?

henrik


henrik

On Fri, Apr 28, 2023 at 7:33 PM Benedict  wrote:

> pgvector is a plug-in. If you were proposing a plug-in you could ignore
> these considerations.
>
> On 28 Apr 2023, at 16:58, Jonathan Ellis  wrote:
>
> 
> I'm proposing a vector data type for ML use cases.  It's not the same
> thing as an array or a list and it's not supposed to be.
>
> While it's true that it would be possible to build a vector type on top of
> an array type, it's not necessary to do it that way, and given the lack of
> interest in an array type for its own sake I don't see why we would want to
> make that a requirement.
>
> It's relevant that pgvector, which among the systems offering vector
> search is based on the most similar system to Cassandra in terms of its
> query language, adds a vector data type that only supports floats *even
> though postgresql already has an array data type* because the semantics are
> different.  Random access doesn't make sense, string and collection and
> other datatypes don't make sense, typical ordered indexes don't make sense,
> etc.  It's just a different beast from arrays, for a different use case.
>
> On Fri, Apr 28, 2023 at 10:40 AM Benedict  wrote:
>
>> But you’re proposing introducing a general purpose type - this isn’t an
>> ML plug-in, it’s modifying the core language in a manner that makes
>> targeting your workload easier. Which is fine, but that means you have to
>> consider its impact on the general language, not just your target use case.
>>
>> On 28 Apr 2023, at 16:29, Jonathan Ellis  wrote:
>>
>> 
>> That's exactly right.
>>
>> In particular it makes no sense at all from an ML perspective to have
>> vector types of anything other than numerics.  And as I mentioned in the
>> POC thread (but I did not mention here), float is overwhelmingly the most
>> frequently used vector type, to the point that Pinecone (by far the most
>> popular vector search engine) ONLY supports that type.
>>
>> Lucene and Elastic also add support for vectors of bytes (8-bit ints),
>> which are useful for optimizing models that you have already built with
>> floats, but we have no reasonable path towards supporting indexing and
>> searches against any other vector type.
>>
>> So in order of what makes sense to me:
>>
>> 1. Add a vector type for just floats; consider adding bytes later if
>> demand materializes. This gives us 99% of the value and limits the scope so
>> we can deliver quickly.
>>
>> 2. Add a vector type for floats or bytes. This gives us another 1% of
>> value in exchange for an extra 20% or so of effort.
>>
>> 3. Add a vector type for all numeric primitives, but you can only index
>> floats and bytes.  I think this is confusing to users and a bad idea.
>>
>> 4. Add a vector type that composes with all Cassandra types.  I can't see
>> a reason to do this, nobody wants it, and we killed the most similar
>> proposal in the past as wontfix.
>>
>> On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie 
>> wrote:
>>
>>> From a machine learning perspective, vectors are a well-known concept
>>> that are effectively immutable fixed-length n-dimensional values that are
>>> then later used either as part of a model or in conjunction with a model
>>> after the fact.
>>>
>>> While we could have this be non-frozen and not call it a vector, I'd be
>>> inclined to still 

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Benedict
pgvector is a plug-in. If you were proposing a plug-in you could ignore these considerations.On 28 Apr 2023, at 16:58, Jonathan Ellis  wrote:I'm proposing a vector data type for ML use cases.  It's not the same thing as an array or a list and it's not supposed to be.While it's true that it would be possible to build a vector type on top of an array type, it's not necessary to do it that way, and given the lack of interest in an array type for its own sake I don't see why we would want to make that a requirement.It's relevant that pgvector, which among the systems offering vector search is based on the most similar system to Cassandra in terms of its query language, adds a vector data type that only supports floats *even though postgresql already has an array data type* because the semantics are different.  Random access doesn't make sense, string and collection and other datatypes don't make sense, typical ordered indexes don't make sense, etc.  It's just a different beast from arrays, for a different use case.On Fri, Apr 28, 2023 at 10:40 AM Benedict  wrote:But you’re proposing introducing a general purpose type - this isn’t an ML plug-in, it’s modifying the core language in a manner that makes targeting your workload easier. Which is fine, but that means you have to consider its impact on the general language, not just your target use case.On 28 Apr 2023, at 16:29, Jonathan Ellis  wrote:That's exactly right.In particular it makes no sense at all from an ML perspective to have vector types of anything other than numerics.  And as I mentioned in the POC thread (but I did not mention here), float is overwhelmingly the most frequently used vector type, to the point that Pinecone (by far the most popular vector search engine) ONLY supports that type.Lucene and Elastic also add support for vectors of bytes (8-bit ints), which are useful for optimizing models that you have already built with floats, but we have no reasonable path towards supporting indexing and searches against any other vector type.So in order of what makes sense to me:1. Add a vector type for just floats; consider adding bytes later if demand materializes. This gives us 99% of the value and limits the scope so we can deliver quickly.2. Add a vector type for floats or bytes. This gives us another 1% of value in exchange for an extra 20% or so of effort.3. Add a vector type for all numeric primitives, but you can only index floats and bytes.  I think this is confusing to users and a bad idea.4. Add a vector type that composes with all Cassandra types.  I can't see a reason to do this, nobody wants it, and we killed the most similar proposal in the past as wontfix.On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie  wrote:From a machine learning perspective, vectors are a well-known concept that are effectively immutable fixed-length n-dimensional values that are then later used either as part of a model or in conjunction with a model after the fact.While we could have this be non-frozen and not call it a vector, I'd be inclined to still make the argument for a layer of syntactic sugar on top that met ML users where they were with concepts they understood rather than forcing them through the cognitive lift of figuring out the Cassandra specific contortions to replicate something that's ubiquitous in their space. We did the same "Cassandra-first" approach with our JSON support and that didn't do us any favors in terms of adoption and usage as far as I know.So is the goal here to provide something specific and idiomatic for the ML community or is the goal to make a primitive that's C*-centric that then another layer can write to? I personally argue for the former; I don't see this specific data type going away any time soon.On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:but as you point out it has the problem of allowing nulls.If nulls are not allowed for the elements, then either we need  a) a new type, or b) add some way to say elements may not be null…. As much as I do like b, I am leaning towards new type for this use case.So, to flesh out the type requirements I have seen so far1) represents a fixed size array of element type* on write path we will need to validate this2) element may not be null* on write path we will need to validate this3) “frozen” (is this really a requirement for the type or is this just simpler for the ANN work?  I feel that this shouldn’t be a requirement)4) works for all types (my requirement; original proposal is float only, but could logically expand to primitive types)Anything else?The key thing about a vector is that unlike lists or tuples you really don't care about individual elements, you care about doing vector and matrix multiplications with the thing as a unit. That maybe true for this use case, but “should” this be true for the type itself?  I feel like no… if a user wants the Nth element of a vector why would we block them?  I am not saying 

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Jonathan Ellis
I'm proposing a vector data type for ML use cases.  It's not the same thing
as an array or a list and it's not supposed to be.

While it's true that it would be possible to build a vector type on top of
an array type, it's not necessary to do it that way, and given the lack of
interest in an array type for its own sake I don't see why we would want to
make that a requirement.

It's relevant that pgvector, which among the systems offering vector search
is based on the most similar system to Cassandra in terms of its query
language, adds a vector data type that only supports floats *even though
postgresql already has an array data type* because the semantics are
different.  Random access doesn't make sense, string and collection and
other datatypes don't make sense, typical ordered indexes don't make sense,
etc.  It's just a different beast from arrays, for a different use case.

On Fri, Apr 28, 2023 at 10:40 AM Benedict  wrote:

> But you’re proposing introducing a general purpose type - this isn’t an ML
> plug-in, it’s modifying the core language in a manner that makes targeting
> your workload easier. Which is fine, but that means you have to consider
> its impact on the general language, not just your target use case.
>
> On 28 Apr 2023, at 16:29, Jonathan Ellis  wrote:
>
> 
> That's exactly right.
>
> In particular it makes no sense at all from an ML perspective to have
> vector types of anything other than numerics.  And as I mentioned in the
> POC thread (but I did not mention here), float is overwhelmingly the most
> frequently used vector type, to the point that Pinecone (by far the most
> popular vector search engine) ONLY supports that type.
>
> Lucene and Elastic also add support for vectors of bytes (8-bit ints),
> which are useful for optimizing models that you have already built with
> floats, but we have no reasonable path towards supporting indexing and
> searches against any other vector type.
>
> So in order of what makes sense to me:
>
> 1. Add a vector type for just floats; consider adding bytes later if
> demand materializes. This gives us 99% of the value and limits the scope so
> we can deliver quickly.
>
> 2. Add a vector type for floats or bytes. This gives us another 1% of
> value in exchange for an extra 20% or so of effort.
>
> 3. Add a vector type for all numeric primitives, but you can only index
> floats and bytes.  I think this is confusing to users and a bad idea.
>
> 4. Add a vector type that composes with all Cassandra types.  I can't see
> a reason to do this, nobody wants it, and we killed the most similar
> proposal in the past as wontfix.
>
> On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie 
> wrote:
>
>> From a machine learning perspective, vectors are a well-known concept
>> that are effectively immutable fixed-length n-dimensional values that are
>> then later used either as part of a model or in conjunction with a model
>> after the fact.
>>
>> While we could have this be non-frozen and not call it a vector, I'd be
>> inclined to still make the argument for a layer of syntactic sugar on top
>> that met ML users where they were with concepts they understood rather than
>> forcing them through the cognitive lift of figuring out the Cassandra
>> specific contortions to replicate something that's ubiquitous in their
>> space. We did the same "Cassandra-first" approach with our JSON support and
>> that didn't do us any favors in terms of adoption and usage as far as I
>> know.
>>
>> So is the goal here to provide something specific and idiomatic for the
>> ML community or is the goal to make a primitive that's C*-centric that then
>> another layer can write to? I personally argue for the former; I don't see
>> this specific data type going away any time soon.
>>
>> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>>
>> but as you point out it has the problem of allowing nulls.
>>
>>
>> If nulls are not allowed for the elements, then either we need  a) a new
>> type, or b) add some way to say elements may not be null…. As much as I do
>> like b, I am leaning towards new type for this use case.
>>
>> So, to flesh out the type requirements I have seen so far
>>
>> 1) represents a fixed size array of element type
>> * on write path we will need to validate this
>> 2) element may not be null
>> * on write path we will need to validate this
>> 3) “frozen” (is this really a requirement for the type or is this
>> just simpler for the ANN work?  I feel that this shouldn’t be a requirement)
>> 4) works for all types (my requirement; original proposal is float only,
>> but could logically expand to primitive types)
>>
>> Anything else?
>>
>> The key thing about a vector is that unlike lists or tuples you really
>> don't care about individual elements, you care about doing vector and
>> matrix multiplications with the thing as a unit.
>>
>>
>> That maybe true for this use case, but “should” this be true for the type
>> itself?  I feel like no… if a user wants the Nth element 

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Benedict
But you’re proposing introducing a general purpose type - this isn’t an ML plug-in, it’s modifying the core language in a manner that makes targeting your workload easier. Which is fine, but that means you have to consider its impact on the general language, not just your target use case.On 28 Apr 2023, at 16:29, Jonathan Ellis  wrote:That's exactly right.In particular it makes no sense at all from an ML perspective to have vector types of anything other than numerics.  And as I mentioned in the POC thread (but I did not mention here), float is overwhelmingly the most frequently used vector type, to the point that Pinecone (by far the most popular vector search engine) ONLY supports that type.Lucene and Elastic also add support for vectors of bytes (8-bit ints), which are useful for optimizing models that you have already built with floats, but we have no reasonable path towards supporting indexing and searches against any other vector type.So in order of what makes sense to me:1. Add a vector type for just floats; consider adding bytes later if demand materializes. This gives us 99% of the value and limits the scope so we can deliver quickly.2. Add a vector type for floats or bytes. This gives us another 1% of value in exchange for an extra 20% or so of effort.3. Add a vector type for all numeric primitives, but you can only index floats and bytes.  I think this is confusing to users and a bad idea.4. Add a vector type that composes with all Cassandra types.  I can't see a reason to do this, nobody wants it, and we killed the most similar proposal in the past as wontfix.On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie  wrote:From a machine learning perspective, vectors are a well-known concept that are effectively immutable fixed-length n-dimensional values that are then later used either as part of a model or in conjunction with a model after the fact.While we could have this be non-frozen and not call it a vector, I'd be inclined to still make the argument for a layer of syntactic sugar on top that met ML users where they were with concepts they understood rather than forcing them through the cognitive lift of figuring out the Cassandra specific contortions to replicate something that's ubiquitous in their space. We did the same "Cassandra-first" approach with our JSON support and that didn't do us any favors in terms of adoption and usage as far as I know.So is the goal here to provide something specific and idiomatic for the ML community or is the goal to make a primitive that's C*-centric that then another layer can write to? I personally argue for the former; I don't see this specific data type going away any time soon.On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:but as you point out it has the problem of allowing nulls.If nulls are not allowed for the elements, then either we need  a) a new type, or b) add some way to say elements may not be null…. As much as I do like b, I am leaning towards new type for this use case.So, to flesh out the type requirements I have seen so far1) represents a fixed size array of element type* on write path we will need to validate this2) element may not be null* on write path we will need to validate this3) “frozen” (is this really a requirement for the type or is this just simpler for the ANN work?  I feel that this shouldn’t be a requirement)4) works for all types (my requirement; original proposal is float only, but could logically expand to primitive types)Anything else?The key thing about a vector is that unlike lists or tuples you really don't care about individual elements, you care about doing vector and matrix multiplications with the thing as a unit. That maybe true for this use case, but “should” this be true for the type itself?  I feel like no… if a user wants the Nth element of a vector why would we block them?  I am not saying the first patch, or even 5.0 adds support for index access, I am just trying to push back saying that the type should not block this.(Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT VECTOR[N].)Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I prefer this syntax but that limitation may not be desired for all use cases… we could always add LIST and ARRAY later to address that case.In terms of syntax I have seen, here is my ordered preference:1) TYPE[size] - have mixed feelings due to non-null, but still prefer it2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this semantic…. Could even be NON NULL TYPE[size]On Apr 27, 2023, at 9:00 AM, Benedict  wrote:That’s a bounded ring buffer, not a fixed length array.This definitely isn’t a tuple because the types are all the same, which is pretty crucial for matrix operations. Matrix libraries generally work on arrays of known dimensionality, or sparse representations.Whether we draw any semantic link between the frozen list and whatever we do here, it is fundamentally a frozen 

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Jonathan Ellis
That's exactly right.

In particular it makes no sense at all from an ML perspective to have
vector types of anything other than numerics.  And as I mentioned in the
POC thread (but I did not mention here), float is overwhelmingly the most
frequently used vector type, to the point that Pinecone (by far the most
popular vector search engine) ONLY supports that type.

Lucene and Elastic also add support for vectors of bytes (8-bit ints),
which are useful for optimizing models that you have already built with
floats, but we have no reasonable path towards supporting indexing and
searches against any other vector type.

So in order of what makes sense to me:

1. Add a vector type for just floats; consider adding bytes later if demand
materializes. This gives us 99% of the value and limits the scope so we can
deliver quickly.

2. Add a vector type for floats or bytes. This gives us another 1% of value
in exchange for an extra 20% or so of effort.

3. Add a vector type for all numeric primitives, but you can only index
floats and bytes.  I think this is confusing to users and a bad idea.

4. Add a vector type that composes with all Cassandra types.  I can't see a
reason to do this, nobody wants it, and we killed the most similar proposal
in the past as wontfix.

On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie  wrote:

> From a machine learning perspective, vectors are a well-known concept that
> are effectively immutable fixed-length n-dimensional values that are then
> later used either as part of a model or in conjunction with a model after
> the fact.
>
> While we could have this be non-frozen and not call it a vector, I'd be
> inclined to still make the argument for a layer of syntactic sugar on top
> that met ML users where they were with concepts they understood rather than
> forcing them through the cognitive lift of figuring out the Cassandra
> specific contortions to replicate something that's ubiquitous in their
> space. We did the same "Cassandra-first" approach with our JSON support and
> that didn't do us any favors in terms of adoption and usage as far as I
> know.
>
> So is the goal here to provide something specific and idiomatic for the ML
> community or is the goal to make a primitive that's C*-centric that then
> another layer can write to? I personally argue for the former; I don't see
> this specific data type going away any time soon.
>
> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>
> but as you point out it has the problem of allowing nulls.
>
>
> If nulls are not allowed for the elements, then either we need  a) a new
> type, or b) add some way to say elements may not be null…. As much as I do
> like b, I am leaning towards new type for this use case.
>
> So, to flesh out the type requirements I have seen so far
>
> 1) represents a fixed size array of element type
> * on write path we will need to validate this
> 2) element may not be null
> * on write path we will need to validate this
> 3) “frozen” (is this really a requirement for the type or is this
> just simpler for the ANN work?  I feel that this shouldn’t be a requirement)
> 4) works for all types (my requirement; original proposal is float only,
> but could logically expand to primitive types)
>
> Anything else?
>
> The key thing about a vector is that unlike lists or tuples you really
> don't care about individual elements, you care about doing vector and
> matrix multiplications with the thing as a unit.
>
>
> That maybe true for this use case, but “should” this be true for the type
> itself?  I feel like no… if a user wants the Nth element of a vector why
> would we block them?  I am not saying the first patch, or even 5.0 adds
> support for index access, I am just trying to push back saying that the
> type should not block this.
>
> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT
> VECTOR[N].)
>
>
> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I
> prefer this syntax but that limitation may not be desired for all use
> cases… we could always add LIST and ARRAY later
> to address that case.
>
> In terms of syntax I have seen, here is my ordered preference:
>
> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this
> semantic…. Could even be NON NULL TYPE[size]
>
> On Apr 27, 2023, at 9:00 AM, Benedict  wrote:
>
>
> That’s a bounded ring buffer, not a fixed length array.
>
> This definitely isn’t a tuple because the types are all the same, which is
> pretty crucial for matrix operations. Matrix libraries generally work on
> arrays of known dimensionality, or sparse representations.
>
> Whether we draw any semantic link between the frozen list and whatever we
> do here, it is fundamentally a frozen list with a restriction on its size.
> What we’re defining here are “statically” sized arrays, whereas a frozen
> list is essentially a dynamically sized array.
>
> I do not think vector 

Re: [DISCUSS] New data type for vector search

2023-04-28 Thread Benedict
This feature may be targeting ML users but it isn’t part of some “ML plug-in” it’s a general purpose type available to all users that happens to permit the use of ANN. So it needs to make sense in a general context, not just to ML users.I also doubt users will struggle with understanding an array or similar type, but vector isn’t a hill I’m going to die on.Frozen, though, should be a requirement for this type and thereby implied by whatever name or syntax we conjur up. Implementing a fixed-size non-frozen type is much more complex, perhaps even nonsensical.Otherwise I broadly agree with David, and maybe lean towards stipulating nullability explicitly.On 28 Apr 2023, at 01:49, Josh McKenzie  wrote:From a machine learning perspective, vectors are a well-known concept that are effectively immutable fixed-length n-dimensional values that are then later used either as part of a model or in conjunction with a model after the fact.While we could have this be non-frozen and not call it a vector, I'd be inclined to still make the argument for a layer of syntactic sugar on top that met ML users where they were with concepts they understood rather than forcing them through the cognitive lift of figuring out the Cassandra specific contortions to replicate something that's ubiquitous in their space. We did the same "Cassandra-first" approach with our JSON support and that didn't do us any favors in terms of adoption and usage as far as I know.So is the goal here to provide something specific and idiomatic for the ML community or is the goal to make a primitive that's C*-centric that then another layer can write to? I personally argue for the former; I don't see this specific data type going away any time soon.On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:but as you point out it has the problem of allowing nulls.If nulls are not allowed for the elements, then either we need  a) a new type, or b) add some way to say elements may not be null…. As much as I do like b, I am leaning towards new type for this use case.So, to flesh out the type requirements I have seen so far1) represents a fixed size array of element type* on write path we will need to validate this2) element may not be null* on write path we will need to validate this3) “frozen” (is this really a requirement for the type or is this just simpler for the ANN work?  I feel that this shouldn’t be a requirement)4) works for all types (my requirement; original proposal is float only, but could logically expand to primitive types)Anything else?The key thing about a vector is that unlike lists or tuples you really don't care about individual elements, you care about doing vector and matrix multiplications with the thing as a unit. That maybe true for this use case, but “should” this be true for the type itself?  I feel like no… if a user wants the Nth element of a vector why would we block them?  I am not saying the first patch, or even 5.0 adds support for index access, I am just trying to push back saying that the type should not block this.(Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT VECTOR[N].)Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I prefer this syntax but that limitation may not be desired for all use cases… we could always add LIST and ARRAY later to address that case.In terms of syntax I have seen, here is my ordered preference:1) TYPE[size] - have mixed feelings due to non-null, but still prefer it2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this semantic…. Could even be NON NULL TYPE[size]On Apr 27, 2023, at 9:00 AM, Benedict  wrote:That’s a bounded ring buffer, not a fixed length array.This definitely isn’t a tuple because the types are all the same, which is pretty crucial for matrix operations. Matrix libraries generally work on arrays of known dimensionality, or sparse representations.Whether we draw any semantic link between the frozen list and whatever we do here, it is fundamentally a frozen list with a restriction on its size. What we’re defining here are “statically” sized arrays, whereas a frozen list is essentially a dynamically sized array.I do not think vector is a good name because vector is used in some other popular languages to mean a (dynamic) list, which is confusing when we also have a list concept.I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with list. Though it is a bit strange that this particular type declaration looks so different to other collection types.On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:It's been a while, so I may be missing something, but do we already have fixed-size lists?  If not, I don't see why we'd try to make this fit into a List-shaped problem.We do not. The proposal got closed as wont-fix  https://issues.apache.org/jira/browse/CASSANDRA-9110

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread steve landiss via dev
 
+1On Thursday, April 27, 2023 at 07:36:19 PM PDT, Caleb Rackliffe 
 wrote:  
 
 I don’t have a lot to add here, other than to say I’m broadly in agreement w/ 
David on syntax preference, element selectability, and making this a new type 
that roughly corresponds to a primitive (non-null-allowing) array.


On Apr 27, 2023, at 9:18 PM, Anthony Grasso  wrote:



It would be strange for this declaration to look different from other 
collection types. We may want to reconsider using the collection syntax. I also 
like the idea of the vector dimensions being declared with the VECTOR keyword. 
An alternative syntax option to explore is:
VECTOR[size]
On Fri, 28 Apr 2023 at 10:49, Josh McKenzie  wrote:

>From a machine learning perspective, vectors are a well-known concept that are 
>effectively immutable fixed-length n-dimensional values that are then later 
>used either as part of a model or in conjunction with a model after the fact.

While we could have this be non-frozen and not call it a vector, I'd be 
inclined to still make the argument for a layer of syntactic sugar on top that 
met ML users where they were with concepts they understood rather than forcing 
them through the cognitive lift of figuring out the Cassandra specific 
contortions to replicate something that's ubiquitous in their space. We did the 
same "Cassandra-first" approach with our JSON support and that didn't do us any 
favors in terms of adoption and usage as far as I know.

So is the goal here to provide something specific and idiomatic for the ML 
community or is the goal to make a primitive that's C*-centric that then 
another layer can write to? I personally argue for the former; I don't see this 
specific data type going away any time soon.
On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:


but as you point out it has the problem of allowing nulls.


If nulls are not allowed for the elements, then either we need  a) a new type, 
or b) add some way to say elements may not be null…. As much as I do like b, I 
am leaning towards new type for this use case.

So, to flesh out the type requirements I have seen so far

1) represents a fixed size array of element type
* on write path we will need to validate this
2) element may not be null
* on write path we will need to validate this
3) “frozen” (is this really a requirement for the type or is this just simpler 
for the ANN work?  I feel that this shouldn’t be a requirement)
4) works for all types (my requirement; original proposal is float only, but 
could logically expand to primitive types)

Anything else?


The key thing about a vector is that unlike lists or tuples you really don't 
care about individual elements, you care about doing vector and matrix 
multiplications with the thing as a unit. 


That maybe true for this use case, but “should” this be true for the type 
itself?  I feel like no… if a user wants the Nth element of a vector why would 
we block them?  I am not saying the first patch, or even 5.0 adds support for 
index access, I am just trying to push back saying that the type should not 
block this.


(Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT VECTOR[N].)


Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I prefer 
this syntax but that limitation may not be desired for all use cases… we could 
always add LIST and ARRAY later to address that case.

In terms of syntax I have seen, here is my ordered preference:

1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this 
semantic…. Could even be NON NULL TYPE[size]


On Apr 27, 2023, at 9:00 AM, Benedict  wrote:


That’s a bounded ring buffer, not a fixed length array.

This definitely isn’t a tuple because the types are all the same, which is 
pretty crucial for matrix operations. Matrix libraries generally work on arrays 
of known dimensionality, or sparse representations.

Whether we draw any semantic link between the frozen list and whatever we do 
here, it is fundamentally a frozen list with a restriction on its size. What 
we’re defining here are “statically” sized arrays, whereas a frozen list is 
essentially a dynamically sized array.

I do not think vector is a good name because vector is used in some other 
popular languages to mean a (dynamic) list, which is confusing when we also 
have a list concept.

I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with 
list. Though it is a bit strange that this particular type declaration looks so 
different to other collection types.


On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:





On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:

It's been a while, so I may be missing something, but do we already have 
fixed-size lists?  If not, I don't see why we'd try to make this fit into a 
List-shaped problem.


We do not. The proposal got closed as wont-fix  

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Caleb Rackliffe
I don’t have a lot to add here, other than to say I’m broadly in agreement w/ David on syntax preference, element selectability, and making this a new type that roughly corresponds to a primitive (non-null-allowing) array.On Apr 27, 2023, at 9:18 PM, Anthony Grasso  wrote:It would be strange for this declaration to look different from other collection types. We may want to reconsider using the collection syntax. I also like the idea of the vector dimensions being declared with the VECTOR keyword. An alternative syntax option to explore is:VECTOR[size]On Fri, 28 Apr 2023 at 10:49, Josh McKenzie  wrote:From a machine learning perspective, vectors are a well-known concept that are effectively immutable fixed-length n-dimensional values that are then later used either as part of a model or in conjunction with a model after the fact.While we could have this be non-frozen and not call it a vector, I'd be inclined to still make the argument for a layer of syntactic sugar on top that met ML users where they were with concepts they understood rather than forcing them through the cognitive lift of figuring out the Cassandra specific contortions to replicate something that's ubiquitous in their space. We did the same "Cassandra-first" approach with our JSON support and that didn't do us any favors in terms of adoption and usage as far as I know.So is the goal here to provide something specific and idiomatic for the ML community or is the goal to make a primitive that's C*-centric that then another layer can write to? I personally argue for the former; I don't see this specific data type going away any time soon.On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:but as you point out it has the problem of allowing nulls.If nulls are not allowed for the elements, then either we need  a) a new type, or b) add some way to say elements may not be null…. As much as I do like b, I am leaning towards new type for this use case.So, to flesh out the type requirements I have seen so far1) represents a fixed size array of element type* on write path we will need to validate this2) element may not be null* on write path we will need to validate this3) “frozen” (is this really a requirement for the type or is this just simpler for the ANN work?  I feel that this shouldn’t be a requirement)4) works for all types (my requirement; original proposal is float only, but could logically expand to primitive types)Anything else?The key thing about a vector is that unlike lists or tuples you really don't care about individual elements, you care about doing vector and matrix multiplications with the thing as a unit. That maybe true for this use case, but “should” this be true for the type itself?  I feel like no… if a user wants the Nth element of a vector why would we block them?  I am not saying the first patch, or even 5.0 adds support for index access, I am just trying to push back saying that the type should not block this.(Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT VECTOR[N].)Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I prefer this syntax but that limitation may not be desired for all use cases… we could always add LIST and ARRAY later to address that case.In terms of syntax I have seen, here is my ordered preference:1) TYPE[size] - have mixed feelings due to non-null, but still prefer it2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this semantic…. Could even be NON NULL TYPE[size]On Apr 27, 2023, at 9:00 AM, Benedict  wrote:That’s a bounded ring buffer, not a fixed length array.This definitely isn’t a tuple because the types are all the same, which is pretty crucial for matrix operations. Matrix libraries generally work on arrays of known dimensionality, or sparse representations.Whether we draw any semantic link between the frozen list and whatever we do here, it is fundamentally a frozen list with a restriction on its size. What we’re defining here are “statically” sized arrays, whereas a frozen list is essentially a dynamically sized array.I do not think vector is a good name because vector is used in some other popular languages to mean a (dynamic) list, which is confusing when we also have a list concept.I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with list. Though it is a bit strange that this particular type declaration looks so different to other collection types.On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:It's been a while, so I may be missing something, but do we already have fixed-size lists?  If not, I don't see why we'd try to make this fit into a List-shaped problem.We do not. The proposal got closed as wont-fix  https://issues.apache.org/jira/browse/CASSANDRA-9110


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Anthony Grasso
It would be strange for this declaration to look different from other
collection types. We may want to reconsider using the collection syntax. I
also like the idea of the vector dimensions being declared with the VECTOR
keyword. An alternative syntax option to explore is:

VECTOR[size]

On Fri, 28 Apr 2023 at 10:49, Josh McKenzie  wrote:

> From a machine learning perspective, vectors are a well-known concept that
> are effectively immutable fixed-length n-dimensional values that are then
> later used either as part of a model or in conjunction with a model after
> the fact.
>
> While we could have this be non-frozen and not call it a vector, I'd be
> inclined to still make the argument for a layer of syntactic sugar on top
> that met ML users where they were with concepts they understood rather than
> forcing them through the cognitive lift of figuring out the Cassandra
> specific contortions to replicate something that's ubiquitous in their
> space. We did the same "Cassandra-first" approach with our JSON support and
> that didn't do us any favors in terms of adoption and usage as far as I
> know.
>
> So is the goal here to provide something specific and idiomatic for the ML
> community or is the goal to make a primitive that's C*-centric that then
> another layer can write to? I personally argue for the former; I don't see
> this specific data type going away any time soon.
>
> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>
> but as you point out it has the problem of allowing nulls.
>
>
> If nulls are not allowed for the elements, then either we need  a) a new
> type, or b) add some way to say elements may not be null…. As much as I do
> like b, I am leaning towards new type for this use case.
>
> So, to flesh out the type requirements I have seen so far
>
> 1) represents a fixed size array of element type
> * on write path we will need to validate this
> 2) element may not be null
> * on write path we will need to validate this
> 3) “frozen” (is this really a requirement for the type or is this
> just simpler for the ANN work?  I feel that this shouldn’t be a requirement)
> 4) works for all types (my requirement; original proposal is float only,
> but could logically expand to primitive types)
>
> Anything else?
>
> The key thing about a vector is that unlike lists or tuples you really
> don't care about individual elements, you care about doing vector and
> matrix multiplications with the thing as a unit.
>
>
> That maybe true for this use case, but “should” this be true for the type
> itself?  I feel like no… if a user wants the Nth element of a vector why
> would we block them?  I am not saying the first patch, or even 5.0 adds
> support for index access, I am just trying to push back saying that the
> type should not block this.
>
> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT
> VECTOR[N].)
>
>
> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I
> prefer this syntax but that limitation may not be desired for all use
> cases… we could always add LIST and ARRAY later
> to address that case.
>
> In terms of syntax I have seen, here is my ordered preference:
>
> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this
> semantic…. Could even be NON NULL TYPE[size]
>
> On Apr 27, 2023, at 9:00 AM, Benedict  wrote:
>
>
> That’s a bounded ring buffer, not a fixed length array.
>
> This definitely isn’t a tuple because the types are all the same, which is
> pretty crucial for matrix operations. Matrix libraries generally work on
> arrays of known dimensionality, or sparse representations.
>
> Whether we draw any semantic link between the frozen list and whatever we
> do here, it is fundamentally a frozen list with a restriction on its size.
> What we’re defining here are “statically” sized arrays, whereas a frozen
> list is essentially a dynamically sized array.
>
> I do not think vector is a good name because vector is used in some other
> popular languages to mean a (dynamic) list, which is confusing when we also
> have a list concept.
>
> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link
> with list. Though it is a bit strange that this particular type declaration
> looks so different to other collection types.
>
> On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:
>
> 
>
>
> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:
>
> It's been a while, so I may be missing something, but do we already have
> fixed-size lists?  If not, I don't see why we'd try to make this fit into a
> List-shaped problem.
>
>
> We do not. The proposal got closed as wont-fix
> https://issues.apache.org/jira/browse/CASSANDRA-9110
>
>
>
>


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Josh McKenzie
>From a machine learning perspective, vectors are a well-known concept that are 
>effectively immutable fixed-length n-dimensional values that are then later 
>used either as part of a model or in conjunction with a model after the fact.

While we could have this be non-frozen and not call it a vector, I'd be 
inclined to still make the argument for a layer of syntactic sugar on top that 
met ML users where they were with concepts they understood rather than forcing 
them through the cognitive lift of figuring out the Cassandra specific 
contortions to replicate something that's ubiquitous in their space. We did the 
same "Cassandra-first" approach with our JSON support and that didn't do us any 
favors in terms of adoption and usage as far as I know.

So is the goal here to provide something specific and idiomatic for the ML 
community or is the goal to make a primitive that's C*-centric that then 
another layer can write to? I personally argue for the former; I don't see this 
specific data type going away any time soon.

On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>> but as you point out it has the problem of allowing nulls.
> 
> If nulls are not allowed for the elements, then either we need  a) a new 
> type, or b) add some way to say elements may not be null…. As much as I do 
> like b, I am leaning towards new type for this use case.
> 
> So, to flesh out the type requirements I have seen so far
> 
> 1) represents a fixed size array of element type
> * on write path we will need to validate this
> 2) element may not be null
> * on write path we will need to validate this
> 3) “frozen” (is this really a requirement for the type or is this just 
> simpler for the ANN work?  I feel that this shouldn’t be a requirement)
> 4) works for all types (my requirement; original proposal is float only, but 
> could logically expand to primitive types)
> 
> Anything else?
> 
>> The key thing about a vector is that unlike lists or tuples you really don't 
>> care about individual elements, you care about doing vector and matrix 
>> multiplications with the thing as a unit. 
> 
> That maybe true for this use case, but “should” this be true for the type 
> itself?  I feel like no… if a user wants the Nth element of a vector why 
> would we block them?  I am not saying the first patch, or even 5.0 adds 
> support for index access, I am just trying to push back saying that the type 
> should not block this.
> 
>> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT 
>> VECTOR[N].)
> 
> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I 
> prefer this syntax but that limitation may not be desired for all use cases… 
> we could always add LIST and ARRAY later to address that 
> case.
> 
> In terms of syntax I have seen, here is my ordered preference:
> 
> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this 
> semantic…. Could even be NON NULL TYPE[size]
> 
>> On Apr 27, 2023, at 9:00 AM, Benedict  wrote:
>> 
>> 
>> That’s a bounded ring buffer, not a fixed length array.
>> 
>> This definitely isn’t a tuple because the types are all the same, which is 
>> pretty crucial for matrix operations. Matrix libraries generally work on 
>> arrays of known dimensionality, or sparse representations.
>> 
>> Whether we draw any semantic link between the frozen list and whatever we do 
>> here, it is fundamentally a frozen list with a restriction on its size. What 
>> we’re defining here are “statically” sized arrays, whereas a frozen list is 
>> essentially a dynamically sized array.
>> 
>> I do not think vector is a good name because vector is used in some other 
>> popular languages to mean a (dynamic) list, which is confusing when we also 
>> have a list concept.
>> 
>> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link 
>> with list. Though it is a bit strange that this particular type declaration 
>> looks so different to other collection types.
>> 
>>> On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:
>>> 
>>> 
>>> 
>>> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:
 It's been a while, so I may be missing something, but do we already have 
 fixed-size lists?  If not, I don't see why we'd try to make this fit into 
 a List-shaped problem.
>>> 
>>> We do not. The proposal got closed as wont-fix  
>>> https://issues.apache.org/jira/browse/CASSANDRA-9110
>>> 
>>> 


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread David Capwell
> but as you point out it has the problem of allowing nulls.

If nulls are not allowed for the elements, then either we need  a) a new type, 
or b) add some way to say elements may not be null…. As much as I do like b, I 
am leaning towards new type for this use case.

So, to flesh out the type requirements I have seen so far

1) represents a fixed size array of element type
* on write path we will need to validate this
2) element may not be null
* on write path we will need to validate this
3) “frozen” (is this really a requirement for the type or is this just simpler 
for the ANN work?  I feel that this shouldn’t be a requirement)
4) works for all types (my requirement; original proposal is float only, but 
could logically expand to primitive types)

Anything else?

> The key thing about a vector is that unlike lists or tuples you really don't 
> care about individual elements, you care about doing vector and matrix 
> multiplications with the thing as a unit. 

That maybe true for this use case, but “should” this be true for the type 
itself?  I feel like no… if a user wants the Nth element of a vector why would 
we block them?  I am not saying the first patch, or even 5.0 adds support for 
index access, I am just trying to push back saying that the type should not 
block this.

> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT 
> VECTOR[N].)

Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I prefer 
this syntax but that limitation may not be desired for all use cases… we could 
always add LIST and ARRAY later to address that case.

In terms of syntax I have seen, here is my ordered preference:

1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this 
semantic…. Could even be NON NULL TYPE[size]

> On Apr 27, 2023, at 9:00 AM, Benedict  wrote:
> 
> That’s a bounded ring buffer, not a fixed length array.
> 
> This definitely isn’t a tuple because the types are all the same, which is 
> pretty crucial for matrix operations. Matrix libraries generally work on 
> arrays of known dimensionality, or sparse representations.
> 
> Whether we draw any semantic link between the frozen list and whatever we do 
> here, it is fundamentally a frozen list with a restriction on its size. What 
> we’re defining here are “statically” sized arrays, whereas a frozen list is 
> essentially a dynamically sized array.
> 
> I do not think vector is a good name because vector is used in some other 
> popular languages to mean a (dynamic) list, which is confusing when we also 
> have a list concept.
> 
> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with 
> list. Though it is a bit strange that this particular type declaration looks 
> so different to other collection types.
> 
>> On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:
>> 
>> 
>> 
>> 
>> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis > > wrote:
>>> It's been a while, so I may be missing something, but do we already have 
>>> fixed-size lists?  If not, I don't see why we'd try to make this fit into a 
>>> List-shaped problem.
>> 
>> We do not. The proposal got closed as wont-fix  
>> https://issues.apache.org/jira/browse/CASSANDRA-9110
>> 
>> 



Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Benedict
That’s a bounded ring buffer, not a fixed length array.This definitely isn’t a tuple because the types are all the same, which is pretty crucial for matrix operations. Matrix libraries generally work on arrays of known dimensionality, or sparse representations.Whether we draw any semantic link between the frozen list and whatever we do here, it is fundamentally a frozen list with a restriction on its size. What we’re defining here are “statically” sized arrays, whereas a frozen list is essentially a dynamically sized array.I do not think vector is a good name because vector is used in some other popular languages to mean a (dynamic) list, which is confusing when we also have a list concept.I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with list. Though it is a bit strange that this particular type declaration looks so different to other collection types.On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:It's been a while, so I may be missing something, but do we already have fixed-size lists?  If not, I don't see why we'd try to make this fit into a List-shaped problem.We do not. The proposal got closed as wont-fix  https://issues.apache.org/jira/browse/CASSANDRA-9110


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Jeff Jirsa
On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:

> It's been a while, so I may be missing something, but do we already have
> fixed-size lists?  If not, I don't see why we'd try to make this fit into a
> List-shaped problem.
>

We do not. The proposal got closed as wont-fix
https://issues.apache.org/jira/browse/CASSANDRA-9110


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Jonathan Ellis
It's been a while, so I may be missing something, but do we already have
fixed-size lists?  If not, I don't see why we'd try to make this fit into a
List-shaped problem.

A tuple would be a better fit from that perspective, but as you point out
it has the problem of allowing nulls.

The key thing about a vector is that unlike lists or tuples you really
don't care about individual elements, you care about doing vector and
matrix multiplications with the thing as a unit.  That's the key reason
that it makes more sense to me as a separate type.

(Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT
VECTOR[N].)


On Wed, Apr 26, 2023 at 4:31 PM Andrés de la Peña 
wrote:

> If we are going to use FLOAT[N] as sugar for another CQL data type, maybe
> tuples are more convenient than lists. So FLOAT[N] could be equivalent to
> TUPLE.
>
> Differently to collections, tuples have a fixed size, they are always
> frozen and I think they don't support random access. These properties seem
> desirable for vectors.
>
> Tuples however support null values, whereas collections doesn't. I mean,
> you can remove elements from a collection, but I think you are never going
> to see an explicit null in the collection. Tuples don't allow to remove a
> value, but the entire tuple can be written with null values. Like in INSERT
> INTO t (key, tuple) VALUES (0,  (1, null, 3)).
>
> On Wed, 26 Apr 2023 at 21:53, Mick Semb Wever  wrote:
>
>> My inclination then would be to say you declare an ARRAY (which
>>> is semantic sugar for FROZEN>). This is very consistent with
>>> our existing style. We then simply permit such columns to define ANN
>>> indexes.
>>>
>>
>>
>> So long as nulls aren't a problem as David questions, an alternative is:
>>
>>  FLOAT[N] as semantic sugar for LIST
>>
>> And ANN requiring FROZEN
>>
>> Maybe taking a poll in a few days will be positive to keep this
>> moving forward.
>>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Andrés de la Peña
If we are going to use FLOAT[N] as sugar for another CQL data type, maybe
tuples are more convenient than lists. So FLOAT[N] could be equivalent to
TUPLE.

Differently to collections, tuples have a fixed size, they are always
frozen and I think they don't support random access. These properties seem
desirable for vectors.

Tuples however support null values, whereas collections doesn't. I mean,
you can remove elements from a collection, but I think you are never going
to see an explicit null in the collection. Tuples don't allow to remove a
value, but the entire tuple can be written with null values. Like in INSERT
INTO t (key, tuple) VALUES (0,  (1, null, 3)).

On Wed, 26 Apr 2023 at 21:53, Mick Semb Wever  wrote:

> My inclination then would be to say you declare an ARRAY (which
>> is semantic sugar for FROZEN>). This is very consistent with
>> our existing style. We then simply permit such columns to define ANN
>> indexes.
>>
>
>
> So long as nulls aren't a problem as David questions, an alternative is:
>
>  FLOAT[N] as semantic sugar for LIST
>
> And ANN requiring FROZEN
>
> Maybe taking a poll in a few days will be positive to keep this
> moving forward.
>


Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Mick Semb Wever
>
> My inclination then would be to say you declare an ARRAY (which
> is semantic sugar for FROZEN>). This is very consistent with
> our existing style. We then simply permit such columns to define ANN
> indexes.
>


So long as nulls aren't a problem as David questions, an alternative is:

 FLOAT[N] as semantic sugar for LIST

And ANN requiring FROZEN

Maybe taking a poll in a few days will be positive to keep this
moving forward.


Re: [DISCUSS] New data type for vector search

2023-04-26 Thread David Capwell
Benedicts comments also makes me question; can any of the values in the vector 
be null?  The patch sent works with float arrays, so null isn’t possible… is 
null not valid for a vector type?  If so this would help justify why is a 
vector not a array or a list (both allow null)

> On Apr 26, 2023, at 10:50 AM, David Capwell  wrote:
> 
> Thanks for starting this thread!
> 
>> In the initial commits and thread, this was DENSE FLOAT32. Nobody really 
>> loved that, so we considered a bunch of alternatives, including
>> 
>> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which 
>> would make it familiar for many users. However, this syntax raises the 
>> question of why arrays cannot be created for other types.  Additionally, the 
>> expectation for an array is to provide random access to its contents, which 
>> is not supported for vectors.
>> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense 
>> vectors, not sparse ones. However, since Lucene had sparse vector support in 
>> the past but removed it for lack of compelling use cases, it is unlikely 
>> that it will be added back, making the "DENSE" qualifier less relevant.
>> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with 
>> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the 
>> reasons mentioned above.
>> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less 
>> natural word order.
>> `VECTOR`: This follows the syntax of our Collections, but again 
>> this would imply that random access is supported, which we want to avoid 
>> doing.
>> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and 
>> could make it difficult to add other vector types, such as byte vectors 
>> (already supported by Lucene), in the future.
> 
> I didn’t look close enough when I saw your patch, is this type multicell or 
> not?  Aka is this acting like a frozen> of fixed size?  I had 
> assumed its non-multicell…. Main reason I ask this now is this pushback for 
> random access…. Lets say I have the following table
> 
> CREATE TABLE fluffy_kittens (
>   pk int PRIMARY KEY,
>   vector FLOAT[42] — don’t ask why fluffy kittens need a vector, they just do!
> )
> 
> If I do the following query, I would expect it to work
> 
> SELECT vector[7] FROM fluffy_kittens WHERE pk=0; — 7 is less than 42
> 
> While working on accord’s CQL integration Caleb and I kept getting bitten by 
> frozen vs non frozen behavior, so many cases just stopped working on frozen 
> collections and should be easy to add (we force user to load the full value 
> already, why can we not touch it?).
> 
> Now, back to the random access comment, assuming this is not multicell why 
> would random access be blocked?  If the type isValueLengthFixed() == true 
> then random access should be simple (else it does require walking the array 
> in-order or to fully deserialize the BB (if working with Lucene I assume we 
> already deserialized out of BB)).  I am just trying to flesh out if there is 
> a limitation not being brought up or is this trying to limit the scope of 
> access for easier testing?
> 
>> However, this syntax raises the question of why arrays cannot be created for 
>> other types
> 
> Left this comment in the other thread, why not?  This could be useful outside 
> the float use case, so having a new "VectorType(AbstractType elements, int 
> size)” is easier/better than a float only version.  I also did a lot of work 
> to fuzz test our type system, so just adding that into the existing generator 
> would get good coverage right off the bat (have another fuzz tester I have 
> not contributed yet, it was done for Accord… it fuzz tests the AST, so would 
> be easy to add this there, that would test type specific access, which the 
> existing tests don’t)
> 
>> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow 
>> consistency if we add other float types like FLOAT16 or FLOAT64
> 
> I do not think we should add a new FLOAT32 type, but I am cool with an alias 
> that has FLOAT32 point to FLOAT.  One negative of this is that the code paths 
> where we return schema back to users would do FLOAT even if user wrote 
> FLOAT32… other than that negative I don’t see any other problems.
> 
>> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance 
>> of clarity, conciseness, and extensibility. It is more natural in its word 
>> order than the original proposal and avoids unnecessary qualifiers, while 
>> still being clear about the data type it represents. Finally, this syntax is 
>> straighforwardly extensible should we choose to support other vector types 
>> in the future.
> 
> My preference is TYPE[n_dimension] but I am ok with this syntax if others 
> prefer it.  I don’t agree that this extra verbosity adds more clarity, there 
> seems to be an assumption that this will tell users that random access isn’t 
> allowed and only blessed types 

Re: [DISCUSS] New data type for vector search

2023-04-26 Thread David Capwell
Thanks for starting this thread!

> In the initial commits and thread, this was DENSE FLOAT32. Nobody really 
> loved that, so we considered a bunch of alternatives, including
> 
> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which 
> would make it familiar for many users. However, this syntax raises the 
> question of why arrays cannot be created for other types.  Additionally, the 
> expectation for an array is to provide random access to its contents, which 
> is not supported for vectors.
> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense 
> vectors, not sparse ones. However, since Lucene had sparse vector support in 
> the past but removed it for lack of compelling use cases, it is unlikely that 
> it will be added back, making the "DENSE" qualifier less relevant.
> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with 
> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the 
> reasons mentioned above.
> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less 
> natural word order.
> `VECTOR`: This follows the syntax of our Collections, but again 
> this would imply that random access is supported, which we want to avoid 
> doing.
> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and 
> could make it difficult to add other vector types, such as byte vectors 
> (already supported by Lucene), in the future.

I didn’t look close enough when I saw your patch, is this type multicell or 
not?  Aka is this acting like a frozen> of fixed size?  I had 
assumed its non-multicell…. Main reason I ask this now is this pushback for 
random access…. Lets say I have the following table

CREATE TABLE fluffy_kittens (
  pk int PRIMARY KEY,
  vector FLOAT[42] — don’t ask why fluffy kittens need a vector, they just do!
)

If I do the following query, I would expect it to work

SELECT vector[7] FROM fluffy_kittens WHERE pk=0; — 7 is less than 42

While working on accord’s CQL integration Caleb and I kept getting bitten by 
frozen vs non frozen behavior, so many cases just stopped working on frozen 
collections and should be easy to add (we force user to load the full value 
already, why can we not touch it?).

Now, back to the random access comment, assuming this is not multicell why 
would random access be blocked?  If the type isValueLengthFixed() == true then 
random access should be simple (else it does require walking the array in-order 
or to fully deserialize the BB (if working with Lucene I assume we already 
deserialized out of BB)).  I am just trying to flesh out if there is a 
limitation not being brought up or is this trying to limit the scope of access 
for easier testing?

> However, this syntax raises the question of why arrays cannot be created for 
> other types

Left this comment in the other thread, why not?  This could be useful outside 
the float use case, so having a new "VectorType(AbstractType elements, int 
size)” is easier/better than a float only version.  I also did a lot of work to 
fuzz test our type system, so just adding that into the existing generator 
would get good coverage right off the bat (have another fuzz tester I have not 
contributed yet, it was done for Accord… it fuzz tests the AST, so would be 
easy to add this there, that would test type specific access, which the 
existing tests don’t)

> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow 
> consistency if we add other float types like FLOAT16 or FLOAT64

I do not think we should add a new FLOAT32 type, but I am cool with an alias 
that has FLOAT32 point to FLOAT.  One negative of this is that the code paths 
where we return schema back to users would do FLOAT even if user wrote FLOAT32… 
other than that negative I don’t see any other problems.

> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance 
> of clarity, conciseness, and extensibility. It is more natural in its word 
> order than the original proposal and avoids unnecessary qualifiers, while 
> still being clear about the data type it represents. Finally, this syntax is 
> straighforwardly extensible should we choose to support other vector types in 
> the future.

My preference is TYPE[n_dimension] but I am ok with this syntax if others 
prefer it.  I don’t agree that this extra verbosity adds more clarity, there 
seems to be an assumption that this will tell users that random access isn’t 
allowed and only blessed types are allowed… both points I feel are not valid 
(or not seen anything published why they should be valid).  There is a 
difference between what a type “could” do and what we implement day 1, I 
wouldn’t want to add more verbosity because of intentions of the day 1 
implementation. 


> On Apr 26, 2023, at 7:31 AM, Jonathan Ellis  wrote:
> 
> Hi all,
> 
> Splitting this out per the suggestion in the initial VS thread so we can work 
> on driver support in parallel with the 

Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Benedict Elliott Smith
I think we need to briefly step back and think about what the syntax means and how it fits into existing syntax.It seems that the dimensionality verbiage assumes we’re logically introducing N vector fields, so that each row adopts a value for all of the vector fields or none. But in practice we are actually introducing a fixed-length frozen list in Cassandra terms, and our API treats this as a per-row array/vector rather than a number of column vectors.My inclination then would be to say you declare an ARRAY (which is semantic sugar for FROZEN>). This is very consistent with our existing style. We then simply permit such columns to define ANN indexes.Otherwise, I think we should lean into the idea that this is a set of N vectors, as “dimensions" makes limited sense when discussing an array length. In this case I would lean towards declaring e.g. 1500 FLOAT VECTORS, maybe. But then I think we should reconsider our presentation a little, and perhaps the result set should treat each vector as a separate field (or something like this).On 26 Apr 2023, at 15:31, Jonathan Ellis  wrote:Hi all,Splitting this out per the suggestion in the initial VS thread so we can work on driver support in parallel with the server-side changes.I propose adding a new data type for vector search indexes:FLOAT VECTOR[N_DIMENSIONS]In the initial commits and thread, this was DENSE FLOAT32. Nobody really loved that, so we considered a bunch of alternatives, including- `FLOAT[N]`: This minimal option resembles C and Java array syntax, which would make it familiar for many users. However, this syntax raises the question of why arrays cannot be created for other types.  Additionally, the expectation for an array is to provide random access to its contents, which is not supported for vectors.- `DENSE FLOAT[N]`: This option clarifies that we are supporting dense vectors, not sparse ones. However, since Lucene had sparse vector support in the past but removed it for lack of compelling use cases, it is unlikely that it will be added back, making the "DENSE" qualifier less relevant.- `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the reasons mentioned above.- `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less natural word order.`VECTOR`: This follows the syntax of our Collections, but again this would imply that random access is supported, which we want to avoid doing.- `VECTOR[N]`: This syntax is not very clear about the vector's contents and could make it difficult to add other vector types, such as byte vectors (already supported by Lucene), in the future.Finally, the original qualifier of 32 in `FLOAT32` was intended to allow consistency if we add other float types like FLOAT16 or FLOAT64, both of which are sometimes used in ML. However, we already have a CQL data type for a 64-bit float (`DOUBLE`), so it would make more sense to add future variants (which remain hypothetical at this point) along that line instead.Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance of clarity, conciseness, and extensibility. It is more natural in its word order than the original proposal and avoids unnecessary qualifiers, while still being clear about the data type it represents. Finally, this syntax is straighforwardly extensible should we choose to support other vector types in the future.-- Jonathan Ellisco-founder, http://www.datastax.com@spyced