Re: [DISCUSS] New data type for vector search

J. D. Jordan Mon, 01 May 2023 16:38:39 -0700

Yes. Plugging in a new type server side is very easy. Adding that type to every client is not.

Cassandra already supports plugging in custom types through a jar. What a given client does when encountering a custom type it doesn’t know about depends on the client.

I was recently looking at this for DynamicCompositeType, which is a type shipped in C* but exposed through the custom type machinery. Very few drivers implement support for it. I saw one driver just crash upon encountering it in the schema at startup. One crashed if you included such a column in a query. And one threw warnings and treated it as a binary blob.

So as David said, the client side is per driver. Also I would recommend thinking about using the existing custom type stuff if possible so that we don’t have to roll the native protocol version to add new type enums and even though unknown customs types act different in each driver, they do mostly allow someone to plug-in an implementation for them.

On May 1, 2023, at 5:12 PM, David Capwell <dcapw...@apple.com> wrote:

A data type plug-in is actually really easy today, I think?

Sadly not, the client reads the class from our schema tables and has to have duplicate logic to serialize/deserialize results… types are easy to add if you are ok with client not understanding them (and will some clients fail due to every language having its own logic?)

On May 1, 2023, at 2:26 PM, Benedict <bened...@apache.org> wrote:

A data type plug-in is actually really easy today, I think? But, developing further hooks should probably be thought through as they’re necessary.

I think in this case it would be simpler to deliver a general purpose type, which is why I’m trying to propose types that would be acceptable.

I also think we’re pretty close to agreement, really?

But if not, let’s flesh out potential plug-in requirements.

On 1 May 2023, at 21:58, Josh McKenzie <jmcken...@apache.org> wrote:

If we want to make an ML-specific data type, it should be in an ML plug-in.
How can we encourage a healthier plug-in ecosystem? As far as I know it's been pretty anemic historically:

cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html
postgres: https://www.postgresql.org/docs/current/contrib.html

I'm really interested to hear if there's more in the ecosystem I'm not aware of or if there's been strides made in this regard; users in the ecosystem being able to write durable extensions to Cassandra that they can then distribute and gain momentum could potentially be a great incubator for new features or functionality in the ecosystem.

If our support for extensions remains as bare as I believe it to be, I wouldn't recommend anyone go that route.

On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:

I have explained repeatedly why I am opposed to ML-specific data types. If we want to make an ML-specific data type, it should be in an ML plug-in. We should not pollute the general purpose language with hastily-considered features that target specific bandwagons - at best partially - no matter how exciting the bandwagon.

I think a simple and easy case can be made for fixed length array types that do not seem to create random bits of cruft in the language that dangle by themselves should this play not pan out. This is an easy way for this effort to make progress without negatively impacting the language.

That is, unless we want to start supporting totally random types for every use case at the top level language layer. I don’t think this is a good idea, personally, and I’m quite confident we would now be regretting this approach had it been taken for earlier bandwagons.

Nor do I think anyone’s priors about how successful this effort will be should matter. As a matter of principle, we should simply never deliver a specialist functionality as a high level CQL language feature without at least baking it for several years as a plug-in.

On 1 May 2023, at 21:03, Mick Semb Wever <m...@apache.org> wrote:

Yes! What you (David) and Benedict write beautifully supports `VECTOR FLOAT[n]` imho.

You are definitely bringing up valid implementation details, and that can be dealt with during patch review. This thread is about the CQL API addition.

No matter which way the technical review goes with the implementation details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML idiomatic approach and the best long-term CQL API. It's a win-win situation – no matter how you look at it imho it is the best solution api wise.

Unless the suggestion is that an ideal implementation can give us a better CQL API – but I don't see what that could be. Maybe the suggestion is we deny the possibility of using the VECTOR keyword and bring us back to something like `NON-NULL FROZEN<FLOAT[n]>`. This is odd to me because `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the patch's audience and their idioms. I have no problems with introducing such an alias to meet the ML crowd.

Another way I think of this is
`VECTOR FLOAT[n]` is the porcelain ML cql api,
`NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the general-use plumbing cql apis.

This would allow implementation details to be moved out of this thread and to the review phase.

On Mon, 1 May 2023 at 20:57, David Capwell <dcapw...@apple.com> wrote:
> I think it is totally reasonable that the ANN patch (and Jonathan) is not asked to implement on top of, or towards, other array (or other) new data types.

This impacts serialization, if you do not think about this day 1 you then can’t add later on without having to worry about migration and versioning…

Honestly I wanted to better understand the cost to be generic and the impact to ANN, so I took https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java and made it handle every requirement I have listed so far (size, null, all types)… the current patch has several bugs at the type level that would need to be fixed, so had to fix those as well…. Total time to do this was 10 minutes… and this includes adding a method "public float[] composeAsFloats(ByteBuffer bytes)” which made the change to existing logic small (change VectorType.Serializer.instance.deserialize(buffer) to type.composeAsFloats(buffer))….

Did this have any impact to the final ByteBuffer? Nope, it had identical layout for the FloatType case, but works for all types…. I didn’t change the fact we store the size (felt this could be removed, but then we could never support expanding the vector in the future…)

So, given the fact it takes a few minutes to implement all these requirements, I do find it very reasonable to push back and say we should make sure the new type is not leaking details from a special ANN index…. We have spent more time debating this than it takes to support… we also have fuzz testing on trunk so just updating org.apache.cassandra.utils.AbstractTypeGenerators to know about this new type means we get type coverage as well…

I have zero issues helping to review this patch and make sure the testing is on-par with existing types (this is a strong requirement for me)

> On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org> wrote:
>
>
> > But suggesting that Jonathan should work on implementing general purpose arrays seems to fall outside the scope of this discussion, since the result of such work wouldn't even fill the need Jonathan is targeting for here.
>
> Every comment I have made so far I have argued that the v1 work doesn’t need to do some things, but that the limitations proposed so far are not real requirements; there is a big difference between what “could be allowed” and what is implemented day one… I am pushing back on what “could be allowed”, so far every justification has been that it slows down the ANN work…
>
> Simple examples of this already exists in C* (every example could be enhanced logically, we just have yet to put in the work)
>
> * updating a element of a list is only allowed for multi-cell
> * appending to a list is only allowed for multi-cell
> * etc.
>
> By saying that the type "shall not support", you actively block future work and future possibilities...
>
>
>
> I am coming around strongly to the `VECTOR FLOAT[n]` option.
>
> This gives Jonathan the simplest path right now with ths ANN work, while also ensuring the CQL API gets the best future potential.
>
> With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null and frozen, and that allows both today and in the future, as desired, for its implementation to be entirely different to `FLOAT[n]`. This addresses a number of people's concerns that we meet ML's idioms head on.
>
> IMHO it feels like it will fit into the ideal future CQL , where all `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and maybe VECTOR BYTE[n]). This will also permit in the future `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays.
>
> I think it is totally reasonable that the ANN patch (and Jonathan) is not asked to implement on top of, or towards, other array (or other) new data types.
>
> I also think it is correct that we think about the evolution of CQL's API, and how it might exist in the future when we have both ml vectors and general use arrays.

Re: [DISCUSS] New data type for vector search

Reply via email to