Re: [DISCUSS] New data type for vector search

Mick Semb Wever Tue, 02 May 2023 00:08:42 -0700

I have no problem with `VECTOR` hanging around forever as an alias for
`NON-NULL FROZEN`.  Even without ANN, it makes sense and will stick with
new C* users.


A plug-in system would be great, but it shouldn't hold back this work imho.



On Mon, 1 May 2023 at 22:17, Benedict <bened...@apache.org> wrote:

> I have explained repeatedly why I am opposed to ML-specific data types. If
> we want to make an ML-specific data type, it should be in an ML plug-in. We
> should not pollute the general purpose language with hastily-considered
> features that target specific bandwagons - at best partially - no matter
> how exciting the bandwagon.
>
> I think a simple and easy case can be made for fixed length array types
> that do not seem to create random bits of cruft in the language that dangle
> by themselves should this play not pan out. This is an easy way for this
> effort to make progress without negatively impacting the language.
>
> That is, unless we want to start supporting totally random types for every
> use case at the top level language layer. I don’t think this is a good
> idea, personally, and I’m quite confident we would now be regretting this
> approach had it been taken for earlier bandwagons.
>
> Nor do I think anyone’s priors about how successful this effort will be
> should matter. As a matter of principle, we should simply never deliver a
> specialist functionality as a high level CQL language feature without at
> least baking it for several years as a plug-in.
>
> On 1 May 2023, at 21:03, Mick Semb Wever <m...@apache.org> wrote:
>
> 
>
> Yes!  What you (David) and Benedict write beautifully supports `VECTOR
> FLOAT[n]` imho.
>
> You are definitely bringing up valid implementation details, and that can
> be dealt with during patch review. This thread is about the CQL API
> addition.
>
> No matter which way the technical review goes with the implementation
> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML
> idiomatic approach and the best long-term CQL API.  It's a win-win
> situation – no matter how you look at it imho it is the best solution api
> wise.
>
> Unless the suggestion is that an ideal implementation can give us a better
> CQL API – but I don't see what that could be.   Maybe the suggestion is we
> deny the possibility of using the VECTOR keyword and bring us back to
> something like `NON-NULL FROZEN<FLOAT[n]>`.   This is odd to me because
> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the
> patch's audience and their idioms.  I have no problems with introducing
> such an alias to meet the ML crowd.
>
> Another way I think of this is
>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>  `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the
> general-use plumbing cql apis.
>
> This would allow implementation details to be moved out of this thread and
> to the review phase.
>
>
>
>
> On Mon, 1 May 2023 at 20:57, David Capwell <dcapw...@apple.com> wrote:
>
>> > I think it is totally reasonable that the ANN patch (and Jonathan) is
>> not asked to implement on top of, or towards, other array (or other) new
>> data types.
>>
>>
>> This impacts serialization, if you do not think about this day 1 you then
>> can’t add later on without having to worry about migration and versioning…
>>
>> Honestly I wanted to better understand the cost to be generic and the
>> impact to ANN, so I took
>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
>> and made it handle every requirement I have listed so far (size, null, all
>> types)… the current patch has several bugs at the type level that would
>> need to be fixed, so had to fix those as well…. Total time to do this was
>> 10 minutes… and this includes adding a method "public float[]
>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic
>> small (change VectorType.Serializer.instance.deserialize(buffer) to
>> type.composeAsFloats(buffer))….
>>
>> Did this have any impact to the final ByteBuffer?  Nope, it had identical
>> layout for the FloatType case, but works for all types…. I didn’t change
>> the fact we store the size (felt this could be removed, but then we could
>> never support expanding the vector in the future…)
>>
>> So, given the fact it takes a few minutes to implement all these
>> requirements, I do find it very reasonable to push back and say we should
>> make sure the new type is not leaking details from a special ANN index…. We
>> have spent more time debating this than it takes to support… we also have
>> fuzz testing on trunk so just updating
>> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new
>> type means we get type coverage as well…
>>
>> I have zero issues helping to review this patch and make sure the testing
>> is on-par with existing types (this is a strong requirement for me)
>>
>>
>> > On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org> wrote:
>> >
>> >
>> > > But suggesting that Jonathan should work on implementing general
>> purpose arrays seems to fall outside the scope of this discussion, since
>> the result of such work wouldn't even fill the need Jonathan is targeting
>> for here.
>> >
>> > Every comment I have made so far I have argued that the v1 work doesn’t
>> need to do some things, but that the limitations proposed so far are not
>> real requirements; there is a big difference between what “could be
>> allowed” and what is implemented day one… I am pushing back on what “could
>> be allowed”, so far every justification has been that it slows down the ANN
>> work…
>> >
>> > Simple examples of this already exists in C* (every example could be
>> enhanced logically, we just have yet to put in the work)
>> >
>> > * updating a element of a list is only allowed for multi-cell
>> > * appending to a list is only allowed for multi-cell
>> > * etc.
>> >
>> > By saying that the type "shall not support", you actively block future
>> work and future possibilities...
>> >
>> >
>> >
>> > I am coming around strongly to the `VECTOR FLOAT[n]` option.
>> >
>> > This gives Jonathan the simplest path right now with ths ANN work,
>> while also ensuring the CQL API gets the best future potential.
>> >
>> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null
>> and frozen, and that allows both today and in the future, as desired, for
>> its implementation to be entirely different to `FLOAT[n]`.  This addresses
>> a number of people's concerns that we meet ML's idioms head on.
>> >
>> > IMHO it feels like it will fit into the ideal future CQL , where all
>> `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and
>> maybe VECTOR BYTE[n]). This will also permit in the future
>> `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays.
>> >
>> > I think it is totally reasonable that the ANN patch (and Jonathan) is
>> not asked to implement on top of, or towards, other array (or other) new
>> data types.
>> >
>> > I also think it is correct that we think about the evolution of CQL's
>> API,  and how it might exist in the future when we have both ml vectors and
>> general use arrays.
>>
>>

Re: [DISCUSS] New data type for vector search

Reply via email to