Re: [POLL] Vector type for ML

Caleb Rackliffe Thu, 04 May 2023 19:02:14 -0700

Even in the ML case, sparse can just mean zeros rather than nulls, and they
should compress similarly anyway.


If we really want null values, I'd rather leave that in collections space.

On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <[email protected]>
wrote:

> I actually still prefer *type[dimension]*, because I think I intuitively
> read this as a primitive (meaning no null elements) array. Then we can have
> the indexing apparatus only accept *frozen<float[n]>* for the HSNW case.
>
> If that isn't intuitive to anyone else, I don't really have a strong
> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
> should indicate single vs. multi-cell, and the other the presence or
> absence of nulls/zeros/whatever.
>
> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <[email protected]>
> wrote:
>
>> I agree with David's reasoning and the use of DENSE (and maybe eventually
>> SPARSE). This is terminology well established in the data world, and it
>> would lead to much easier adoption from users. VECTOR is close, but I can
>> see having to create a lot of content around "How to use it and not get in
>> trouble." (I have a lot of that content already)
>>
>>  - We don't have to explain what it is. A lot of prior art out there
>> already [1][2][3]
>>  - We're matching an established term with what users would expect. No
>> surprises.
>>  - Shorter ramp-up time for users. Cassandra is being modernized.
>>
>> The implementation is flexible, but the interface should empower our
>> users to be awesome.
>>
>> Patrick
>>
>> 1 -
>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
>> 2 -
>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
>> 3 -
>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
>>
>> On Thu, May 4, 2023 at 10:25 AM David Capwell <[email protected]> wrote:
>>
>>> My views have changed over time on syntax and I feel type[dimention] may
>>> not be the best, so it has gone lower in my own personal ranking… this is
>>> my current preference
>>>
>>> 1) DENSE <type>[dimention] | NON NULL <type>[dimention]
>>> 2) VECTOR<type, dimention>
>>> 3) type[dimention]
>>>
>>> My reasoning for this order
>>>
>>> * type[dimention] looks like syntax sugar for array<type, dimention>, so
>>> users may assume list/array semantics, but we limit to non-null elements in
>>> a frozen array
>>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct
>>> type makes more sense… this also leads to a possible future of VECTOR<type>
>>> which is the non-fixed length version of this type.  What makes VECTOR
>>> different from list/array?  non-null elements and is frozen.  I don’t feel
>>> that VECTOR really tells users to expect non-null or frozen semantics, as
>>> there exists different VECTOR types for those reasons (sparse vs dense)…
>>> * DENSE may be confusing for people coming from languages where this
>>> just means “sequential layout”, which is what our frozen array/list already
>>> are… but since the target user is coming from a ML background, this
>>> shouldn’t offer much confusion.  DENSE just means FROZEN in Cassandra, with
>>> NON NULL elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just
>>> acts as syntax sugar for frozen<non null type[dimention]>
>>>
>>>
>>> On May 4, 2023, at 4:13 AM, Brandon Williams <[email protected]> wrote:
>>>
>>> 1. VECTOR<FLOAT,n>
>>> 2. VECTOR FLOAT[n]
>>> 3. FLOAT[N]   (Non null by default)
>>>
>>> Redundant or not, I think having the VECTOR keyword helps signify what
>>> the app is generally about and helps get buy-in from ML stakeholders.
>>>
>>> On Thu, May 4, 2023 at 3:45 AM Benedict <[email protected]> wrote:
>>>
>>>
>>> Hurrah for initial agreement.
>>>
>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>>> think VECTOR should be used to simply imply non-null, as this would be very
>>> unintuitive. More logical would be NONNULL, if this is the only condition
>>> being applied. Alternatively for arrays we could default to NONNULL and
>>> later introduce NULLABLE if we want to permit nulls.
>>>
>>> If the word vector is to be used it makes more sense to make it look
>>> like a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not
>>> redundant.
>>>
>>> So, I vote:
>>>
>>> 1) (NON NULL) FLOAT[N]
>>> 2) FLOAT[N]   (Non null by default)
>>> 3) VECTOR<FLOAT, N>
>>>
>>>
>>>
>>> On 4 May 2023, at 08:52, Mick Semb Wever <[email protected]> wrote:
>>>
>>> 
>>>
>>>
>>> Did we agree on a CQL syntax?
>>>
>>> I don’t believe there has been a pool on CQL syntax… my understanding
>>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>>> believe we are waiting for majority rule on this?
>>>
>>>
>>>
>>>
>>> Re-reading that thread, IIUC the valid choices remaining are…
>>>
>>> 1. VECTOR FLOAT[n]
>>> 2. FLOAT VECTOR[n]
>>> 3. VECTOR<FLOAT,n>
>>> 4. VECTOR[n]<FLOAT>
>>> 5. ARRAY<FLOAT, n>
>>> 6. NON-NULL FROZEN<FLOAT[n]>
>>>
>>>
>>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>>> future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR
>>> keyword is: for general cql users; just meaning "non-null and frozen",
>>> these gel best together.
>>>
>>> Options (5) and (6) are for those that feel we can and should provide
>>> this type without introducing the vector keyword.
>>>
>>>
>>>
>>>

Re: [POLL] Vector type for ML

Reply via email to