Re: [DISCUSS] New data type for vector search

Mick Semb Wever Mon, 01 May 2023 13:03:09 -0700

Yes!  What you (David) and Benedict write beautifully supports `VECTOR
FLOAT[n]` imho.


You are definitely bringing up valid implementation details, and that can
be dealt with during patch review. This thread is about the CQL API
addition.

No matter which way the technical review goes with the implementation
details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML
idiomatic approach and the best long-term CQL API.  It's a win-win
situation – no matter how you look at it imho it is the best solution api
wise.

Unless the suggestion is that an ideal implementation can give us a better
CQL API – but I don't see what that could be.   Maybe the suggestion is we
deny the possibility of using the VECTOR keyword and bring us back to
something like `NON-NULL FROZEN<FLOAT[n]>`.   This is odd to me because
`VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the
patch's audience and their idioms.  I have no problems with introducing
such an alias to meet the ML crowd.

Another way I think of this is
 `VECTOR FLOAT[n]` is the porcelain ML cql api,
 `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the
general-use plumbing cql apis.

This would allow implementation details to be moved out of this thread and
to the review phase.




On Mon, 1 May 2023 at 20:57, David Capwell <[email protected]> wrote:

> > I think it is totally reasonable that the ANN patch (and Jonathan) is
> not asked to implement on top of, or towards, other array (or other) new
> data types.
>
>
> This impacts serialization, if you do not think about this day 1 you then
> can’t add later on without having to worry about migration and versioning…
>
> Honestly I wanted to better understand the cost to be generic and the
> impact to ANN, so I took
> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
> and made it handle every requirement I have listed so far (size, null, all
> types)… the current patch has several bugs at the type level that would
> need to be fixed, so had to fix those as well…. Total time to do this was
> 10 minutes… and this includes adding a method "public float[]
> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic
> small (change VectorType.Serializer.instance.deserialize(buffer) to
> type.composeAsFloats(buffer))….
>
> Did this have any impact to the final ByteBuffer?  Nope, it had identical
> layout for the FloatType case, but works for all types…. I didn’t change
> the fact we store the size (felt this could be removed, but then we could
> never support expanding the vector in the future…)
>
> So, given the fact it takes a few minutes to implement all these
> requirements, I do find it very reasonable to push back and say we should
> make sure the new type is not leaking details from a special ANN index…. We
> have spent more time debating this than it takes to support… we also have
> fuzz testing on trunk so just updating
> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new
> type means we get type coverage as well…
>
> I have zero issues helping to review this patch and make sure the testing
> is on-par with existing types (this is a strong requirement for me)
>
>
> > On May 1, 2023, at 10:40 AM, Mick Semb Wever <[email protected]> wrote:
> >
> >
> > > But suggesting that Jonathan should work on implementing general
> purpose arrays seems to fall outside the scope of this discussion, since
> the result of such work wouldn't even fill the need Jonathan is targeting
> for here.
> >
> > Every comment I have made so far I have argued that the v1 work doesn’t
> need to do some things, but that the limitations proposed so far are not
> real requirements; there is a big difference between what “could be
> allowed” and what is implemented day one… I am pushing back on what “could
> be allowed”, so far every justification has been that it slows down the ANN
> work…
> >
> > Simple examples of this already exists in C* (every example could be
> enhanced logically, we just have yet to put in the work)
> >
> > * updating a element of a list is only allowed for multi-cell
> > * appending to a list is only allowed for multi-cell
> > * etc.
> >
> > By saying that the type "shall not support", you actively block future
> work and future possibilities...
> >
> >
> >
> > I am coming around strongly to the `VECTOR FLOAT[n]` option.
> >
> > This gives Jonathan the simplest path right now with ths ANN work, while
> also ensuring the CQL API gets the best future potential.
> >
> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null
> and frozen, and that allows both today and in the future, as desired, for
> its implementation to be entirely different to `FLOAT[n]`.  This addresses
> a number of people's concerns that we meet ML's idioms head on.
> >
> > IMHO it feels like it will fit into the ideal future CQL , where all
> `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and
> maybe VECTOR BYTE[n]). This will also permit in the future
> `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays.
> >
> > I think it is totally reasonable that the ANN patch (and Jonathan) is
> not asked to implement on top of, or towards, other array (or other) new
> data types.
> >
> > I also think it is correct that we think about the evolution of CQL's
> API,  and how it might exist in the future when we have both ml vectors and
> general use arrays.
>
>

Re: [DISCUSS] New data type for vector search

Reply via email to