I have no problem with `VECTOR` hanging around forever as an alias for `NON-NULL FROZEN`. Even without ANN, it makes sense and will stick with new C* users.
A plug-in system would be great, but it shouldn't hold back this work imho. On Mon, 1 May 2023 at 22:17, Benedict <bened...@apache.org> wrote: > I have explained repeatedly why I am opposed to ML-specific data types. If > we want to make an ML-specific data type, it should be in an ML plug-in. We > should not pollute the general purpose language with hastily-considered > features that target specific bandwagons - at best partially - no matter > how exciting the bandwagon. > > I think a simple and easy case can be made for fixed length array types > that do not seem to create random bits of cruft in the language that dangle > by themselves should this play not pan out. This is an easy way for this > effort to make progress without negatively impacting the language. > > That is, unless we want to start supporting totally random types for every > use case at the top level language layer. I don’t think this is a good > idea, personally, and I’m quite confident we would now be regretting this > approach had it been taken for earlier bandwagons. > > Nor do I think anyone’s priors about how successful this effort will be > should matter. As a matter of principle, we should simply never deliver a > specialist functionality as a high level CQL language feature without at > least baking it for several years as a plug-in. > > On 1 May 2023, at 21:03, Mick Semb Wever <m...@apache.org> wrote: > > > > Yes! What you (David) and Benedict write beautifully supports `VECTOR > FLOAT[n]` imho. > > You are definitely bringing up valid implementation details, and that can > be dealt with during patch review. This thread is about the CQL API > addition. > > No matter which way the technical review goes with the implementation > details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML > idiomatic approach and the best long-term CQL API. It's a win-win > situation – no matter how you look at it imho it is the best solution api > wise. > > Unless the suggestion is that an ideal implementation can give us a better > CQL API – but I don't see what that could be. Maybe the suggestion is we > deny the possibility of using the VECTOR keyword and bring us back to > something like `NON-NULL FROZEN<FLOAT[n]>`. This is odd to me because > `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the > patch's audience and their idioms. I have no problems with introducing > such an alias to meet the ML crowd. > > Another way I think of this is > `VECTOR FLOAT[n]` is the porcelain ML cql api, > `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the > general-use plumbing cql apis. > > This would allow implementation details to be moved out of this thread and > to the review phase. > > > > > On Mon, 1 May 2023 at 20:57, David Capwell <dcapw...@apple.com> wrote: > >> > I think it is totally reasonable that the ANN patch (and Jonathan) is >> not asked to implement on top of, or towards, other array (or other) new >> data types. >> >> >> This impacts serialization, if you do not think about this day 1 you then >> can’t add later on without having to worry about migration and versioning… >> >> Honestly I wanted to better understand the cost to be generic and the >> impact to ANN, so I took >> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java >> and made it handle every requirement I have listed so far (size, null, all >> types)… the current patch has several bugs at the type level that would >> need to be fixed, so had to fix those as well…. Total time to do this was >> 10 minutes… and this includes adding a method "public float[] >> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic >> small (change VectorType.Serializer.instance.deserialize(buffer) to >> type.composeAsFloats(buffer))…. >> >> Did this have any impact to the final ByteBuffer? Nope, it had identical >> layout for the FloatType case, but works for all types…. I didn’t change >> the fact we store the size (felt this could be removed, but then we could >> never support expanding the vector in the future…) >> >> So, given the fact it takes a few minutes to implement all these >> requirements, I do find it very reasonable to push back and say we should >> make sure the new type is not leaking details from a special ANN index…. We >> have spent more time debating this than it takes to support… we also have >> fuzz testing on trunk so just updating >> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new >> type means we get type coverage as well… >> >> I have zero issues helping to review this patch and make sure the testing >> is on-par with existing types (this is a strong requirement for me) >> >> >> > On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org> wrote: >> > >> > >> > > But suggesting that Jonathan should work on implementing general >> purpose arrays seems to fall outside the scope of this discussion, since >> the result of such work wouldn't even fill the need Jonathan is targeting >> for here. >> > >> > Every comment I have made so far I have argued that the v1 work doesn’t >> need to do some things, but that the limitations proposed so far are not >> real requirements; there is a big difference between what “could be >> allowed” and what is implemented day one… I am pushing back on what “could >> be allowed”, so far every justification has been that it slows down the ANN >> work… >> > >> > Simple examples of this already exists in C* (every example could be >> enhanced logically, we just have yet to put in the work) >> > >> > * updating a element of a list is only allowed for multi-cell >> > * appending to a list is only allowed for multi-cell >> > * etc. >> > >> > By saying that the type "shall not support", you actively block future >> work and future possibilities... >> > >> > >> > >> > I am coming around strongly to the `VECTOR FLOAT[n]` option. >> > >> > This gives Jonathan the simplest path right now with ths ANN work, >> while also ensuring the CQL API gets the best future potential. >> > >> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null >> and frozen, and that allows both today and in the future, as desired, for >> its implementation to be entirely different to `FLOAT[n]`. This addresses >> a number of people's concerns that we meet ML's idioms head on. >> > >> > IMHO it feels like it will fit into the ideal future CQL , where all >> `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and >> maybe VECTOR BYTE[n]). This will also permit in the future >> `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays. >> > >> > I think it is totally reasonable that the ANN patch (and Jonathan) is >> not asked to implement on top of, or towards, other array (or other) new >> data types. >> > >> > I also think it is correct that we think about the evolution of CQL's >> API, and how it might exist in the future when we have both ml vectors and >> general use arrays. >> >>