Re: [DISCUSS] New data type for vector search

Benedict Mon, 01 May 2023 11:38:59 -0700

Has anybody yet claimed it would be hard? Several folk seem ready to jump to 
the conclusion that this would be onerous, but as somebody with a good 
understanding of the storage layer I can assert with reasonable confidence that 
it would not be. As previously stated, the implementation largely already 
exists for frozen lists.


If we are going to let difficulty of implementation inform our CQL evolution, 
my view is that the bar for additional difficulty should be high, as CQL 
changes need to be well considered as they are not easily revisited - bad 
decisions survive indefinitely. The alternative as David points out is a 
plug-in system.

So, maybe let’s wait until somebody makes a specific and serious claim of how 
challenging it would be, with justification, before we jump to compromising our 
language evolution based on it. I’m not even sure yet that this is really a 
consideration by anyone involved.

> On 1 May 2023, at 18:41, Mick Semb Wever <m...@apache.org> wrote:
> 
> 
>> 
>> > But suggesting that Jonathan should work on implementing general purpose 
>> > arrays seems to fall outside the scope of this discussion, since the 
>> > result of such work wouldn't even fill the need Jonathan is targeting for 
>> > here. 
>> 
>> Every comment I have made so far I have argued that the v1 work doesn’t need 
>> to do some things, but that the limitations proposed so far are not real 
>> requirements; there is a big difference between what “could be allowed” and 
>> what is implemented day one… I am pushing back on what “could be allowed”, 
>> so far every justification has been that it slows down the ANN work…
>> 
>> Simple examples of this already exists in C* (every example could be 
>> enhanced logically, we just have yet to put in the work)
>> 
>> * updating a element of a list is only allowed for multi-cell
>> * appending to a list is only allowed for multi-cell
>> * etc.
>> 
>> By saying that the type "shall not support", you actively block future work 
>> and future possibilities...
> 
> 
> 
> I am coming around strongly to the `VECTOR FLOAT[n]` option.
> 
> This gives Jonathan the simplest path right now with ths ANN work, while also 
> ensuring the CQL API gets the best future potential.
> 
> With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null and 
> frozen, and that allows both today and in the future, as desired, for its 
> implementation to be entirely different to `FLOAT[n]`.  This addresses a 
> number of people's concerns that we meet ML's idioms head on.
> 
> IMHO it feels like it will fit into the ideal future CQL , where all 
> `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and maybe 
> VECTOR BYTE[n]). This will also permit in the future `FROZEN<primitive[n]>` 
> if we wanted nulls in frozen arrays.
> 
> I think it is totally reasonable that the ANN patch (and Jonathan) is not 
> asked to implement on top of, or towards, other array (or other) new data 
> types.
> 
> I also think it is correct that we think about the evolution of CQL's API,  
> and how it might exist in the future when we have both ml vectors and general 
> use arrays.

Re: [DISCUSS] New data type for vector search

Reply via email to