Re: [DISCUSS] New data type for vector search

David Capwell Mon, 01 May 2023 13:39:44 -0700

> I think a simple and easy case can be made for fixed length array types that 
> do not seem to create random bits of cruft in the language that dangle by 
> themselves should this play not pan out.


If I am understanding you correctly, then a "VECTOR FLOAT[n]” is fine as its a 
array type but has 2 new properties: fixed length, NON NULL elements… for 
optimization reasons we can always have a special composeForFloat(ByteBuffer) 
-> float[] which can be used by ANN (avoids boxing and java.util.List)

This makes sure the type doesn’t care about ML and can be used for other use 
cases, but for any ML systems that come after the fact they can special for 
this type… this is basically the change I made locally to test out the effort

> On May 1, 2023, at 1:17 PM, Benedict <[email protected]> wrote:
> 
> I have explained repeatedly why I am opposed to ML-specific data types. If we 
> want to make an ML-specific data type, it should be in an ML plug-in. We 
> should not pollute the general purpose language with hastily-considered 
> features that target specific bandwagons - at best partially - no matter how 
> exciting the bandwagon.
> 
> I think a simple and easy case can be made for fixed length array types that 
> do not seem to create random bits of cruft in the language that dangle by 
> themselves should this play not pan out. This is an easy way for this effort 
> to make progress without negatively impacting the language.
> 
> That is, unless we want to start supporting totally random types for every 
> use case at the top level language layer. I don’t think this is a good idea, 
> personally, and I’m quite confident we would now be regretting this approach 
> had it been taken for earlier bandwagons.
> 
> Nor do I think anyone’s priors about how successful this effort will be 
> should matter. As a matter of principle, we should simply never deliver a 
> specialist functionality as a high level CQL language feature without at 
> least baking it for several years as a plug-in.
> 
>> On 1 May 2023, at 21:03, Mick Semb Wever <[email protected]> wrote:
>> 
>> 
>> 
>> Yes!  What you (David) and Benedict write beautifully supports `VECTOR 
>> FLOAT[n]` imho.
>> 
>> You are definitely bringing up valid implementation details, and that can be 
>> dealt with during patch review. This thread is about the CQL API addition.  
>> 
>> No matter which way the technical review goes with the implementation 
>> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML 
>> idiomatic approach and the best long-term CQL API.  It's a win-win situation 
>> – no matter how you look at it imho it is the best solution api wise.  
>> 
>> Unless the suggestion is that an ideal implementation can give us a better 
>> CQL API – but I don't see what that could be.   Maybe the suggestion is we 
>> deny the possibility of using the VECTOR keyword and bring us back to 
>> something like `NON-NULL FROZEN<FLOAT[n]>`.   This is odd to me because 
>> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the 
>> patch's audience and their idioms.  I have no problems with introducing such 
>> an alias to meet the ML crowd.
>> 
>> Another way I think of this is
>>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>>  `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the 
>> general-use plumbing cql apis. 
>> 
>> This would allow implementation details to be moved out of this thread and 
>> to the review phase.
>> 
>> 
>> 
>> 
>> On Mon, 1 May 2023 at 20:57, David Capwell <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is not 
>>> > asked to implement on top of, or towards, other array (or other) new data 
>>> > types.
>>> 
>>> 
>>> This impacts serialization, if you do not think about this day 1 you then 
>>> can’t add later on without having to worry about migration and versioning… 
>>> 
>>> Honestly I wanted to better understand the cost to be generic and the 
>>> impact to ANN, so I took 
>>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
>>>  and made it handle every requirement I have listed so far (size, null, all 
>>> types)… the current patch has several bugs at the type level that would 
>>> need to be fixed, so had to fix those as well…. Total time to do this was 
>>> 10 minutes… and this includes adding a method "public float[] 
>>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic 
>>> small (change VectorType.Serializer.instance.deserialize(buffer) to 
>>> type.composeAsFloats(buffer))….
>>> 
>>> Did this have any impact to the final ByteBuffer?  Nope, it had identical 
>>> layout for the FloatType case, but works for all types…. I didn’t change 
>>> the fact we store the size (felt this could be removed, but then we could 
>>> never support expanding the vector in the future…)
>>> 
>>> So, given the fact it takes a few minutes to implement all these 
>>> requirements, I do find it very reasonable to push back and say we should 
>>> make sure the new type is not leaking details from a special ANN index…. We 
>>> have spent more time debating this than it takes to support… we also have 
>>> fuzz testing on trunk so just updating 
>>> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new 
>>> type means we get type coverage as well…
>>> 
>>> I have zero issues helping to review this patch and make sure the testing 
>>> is on-par with existing types (this is a strong requirement for me)
>>> 
>>> 
>>> > On May 1, 2023, at 10:40 AM, Mick Semb Wever <[email protected] 
>>> > <mailto:[email protected]>> wrote:
>>> > 
>>> > 
>>> > > But suggesting that Jonathan should work on implementing general 
>>> > > purpose arrays seems to fall outside the scope of this discussion, 
>>> > > since the result of such work wouldn't even fill the need Jonathan is 
>>> > > targeting for here. 
>>> > 
>>> > Every comment I have made so far I have argued that the v1 work doesn’t 
>>> > need to do some things, but that the limitations proposed so far are not 
>>> > real requirements; there is a big difference between what “could be 
>>> > allowed” and what is implemented day one… I am pushing back on what 
>>> > “could be allowed”, so far every justification has been that it slows 
>>> > down the ANN work…
>>> > 
>>> > Simple examples of this already exists in C* (every example could be 
>>> > enhanced logically, we just have yet to put in the work)
>>> > 
>>> > * updating a element of a list is only allowed for multi-cell
>>> > * appending to a list is only allowed for multi-cell
>>> > * etc.
>>> > 
>>> > By saying that the type "shall not support", you actively block future 
>>> > work and future possibilities...
>>> > 
>>> > 
>>> > 
>>> > I am coming around strongly to the `VECTOR FLOAT[n]` option.
>>> > 
>>> > This gives Jonathan the simplest path right now with ths ANN work, while 
>>> > also ensuring the CQL API gets the best future potential.
>>> > 
>>> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null 
>>> > and frozen, and that allows both today and in the future, as desired, for 
>>> > its implementation to be entirely different to `FLOAT[n]`.  This 
>>> > addresses a number of people's concerns that we meet ML's idioms head on.
>>> > 
>>> > IMHO it feels like it will fit into the ideal future CQL , where all 
>>> > `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and 
>>> > maybe VECTOR BYTE[n]). This will also permit in the future 
>>> > `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays.
>>> > 
>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is not 
>>> > asked to implement on top of, or towards, other array (or other) new data 
>>> > types.
>>> > 
>>> > I also think it is correct that we think about the evolution of CQL's 
>>> > API,  and how it might exist in the future when we have both ml vectors 
>>> > and general use arrays.
>>>

Re: [DISCUSS] New data type for vector search

Reply via email to