Idiomatically, to my mind, there's a question of "what space are we thinking 
about this datatype in"?

- In the context of mathematics, nullability in a vector would be 0
- In the context of Cassandra, nullability tends to mean a tombstone (or 
nothing)
- In the context of programming languages, it's all over the place

Given many models are exploring quantizing to int8 and other data types, 
there's definitely the "support other data types easily in the future" piece to 
me we need to keep in mind.

So with the above and the "meet the user where they are and don't make them 
understand more of Cassandra than absolutely critical to use it", I lean:

1. DENSE_VECTOR<type, dimension>
2. VECTOR<type, dimension>
3. type[dimension]

This leaves the path open for us to expand on it in the future with sparse 
support and allows us to introduce some semantics that indicate idioms around 
nullability for the users coming from a different space.

"NON-NULL FROZEN<TYPE[n]>" is strictly correct, however it requires 
understanding idioms of how Cassandra thinks about data (nulls mean different 
things to us, we have differences between frozen and non-frozen due to 
constraints in our storage engine and materialization of data, etc) that get in 
the way of users doing things in the pattern they're familiar with without 
learning more about the DB than they're probably looking to learn. Historically 
this has been a challenge for us in adoption; the classic "Why can't I just 
write and delete and write as much as I want? Why are deletes filling up my 
disk?" problem comes to mind.

I'd also be happy with us supporting:
* NON-NULL FROZEN<TYPE[n]>
* DENSE_VECTOR<type, dimension> as syntactic sugar for the above

If getting into the "built-in syntactic sugar mapping for communities and 
specific use-cases" is something we're willing to consider.

On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
> I think we are still discussing implementation here when I'm talking about 
> developer experience. I want developers to adopt this quickly, easily and be 
> successful. Vector search is already a thing. People use it every day. A 
> successful outcome, in my view, is developers picking up this feature without 
> reading a manual. (Because they don't anyway and get in trouble) I did some 
> more extensive research about what other DBs are using for syntax. The 
> consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
> 
> Pinecone[1] - dense_vector, sparse_vector
> Elastic[2]: dense_vector
> Milvus[3]: float_vector, binary_vector
> pgvector[4]: vector
> Weaviate[5]: Different approach. All typed arrays can be indexed
> 
> Based on that I'm advocating a similar syntax:
> 
> - DENSE VECTOR
> or
> - VECTOR
> 
> [1] https://docs.pinecone.io/docs/hybrid-search
> [2] 
> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
> [3] https://milvus.io/docs/create_collection.md
> [4] https://github.com/pgvector/pgvector
> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
> 
> On Fri, May 5, 2023 at 6:07 AM Mike Adamson <madam...@datastax.com> wrote:
>>> Then we can have the indexing apparatus only accept *frozen<float[n]>* for 
>>> the HSNW case.
>> I'm inclined to agree with Benedict that the index will need to be 
>> specifically select by option rather than inferred based on type. As such 
>> there is no real reason for the *frozen* requirement on the type. The hnsw 
>> index can be built just as easily from a non-frozen array.
>> 
>> I am in favour of enforcing non-null on the elements of an array by default. 
>> I would prefer that allowing nulls in the array would be a later addition if 
>> and when a use case arose for it.
>> 
>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe <calebrackli...@gmail.com> 
>> wrote:
>>> Even in the ML case, sparse can just mean zeros rather than nulls, and they 
>>> should compress similarly anyway.
>>> 
>>> If we really want null values, I'd rather leave that in collections space.
>>> 
>>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <calebrackli...@gmail.com> 
>>> wrote:
>>>> I actually still prefer *type[dimension]*, because I think I intuitively 
>>>> read this as a primitive (meaning no null elements) array. Then we can 
>>>> have the indexing apparatus only accept *frozen<float[n]>* for the HSNW 
>>>> case.
>>>> 
>>>> If that isn't intuitive to anyone else, I don't really have a strong 
>>>> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One 
>>>> should indicate single vs. multi-cell, and the other the presence or 
>>>> absence of nulls/zeros/whatever.
>>>> 
>>>> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <pmcfa...@gmail.com> wrote:
>>>>> I agree with David's reasoning and the use of DENSE (and maybe eventually 
>>>>> SPARSE). This is terminology well established in the data world, and it 
>>>>> would lead to much easier adoption from users. VECTOR is close, but I can 
>>>>> see having to create a lot of content around "How to use it and not get 
>>>>> in trouble." (I have a lot of that content already)
>>>>> 
>>>>>  - We don't have to explain what it is. A lot of prior art out there 
>>>>> already [1][2][3]
>>>>>  - We're matching an established term with what users would expect. No 
>>>>> surprises. 
>>>>>  - Shorter ramp-up time for users. Cassandra is being modernized.
>>>>> 
>>>>> The implementation is flexible, but the interface should empower our 
>>>>> users to be awesome. 
>>>>> 
>>>>> Patrick
>>>>> 
>>>>> 1 - 
>>>>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
>>>>>  
>>>>> <https://urldefense.com/v3/__https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud6ieKGQw$>
>>>>> 2 - 
>>>>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
>>>>>  
>>>>> <https://urldefense.com/v3/__https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ue1o2CO2Q$>
>>>>> 3 - 
>>>>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/ 
>>>>> <https://urldefense.com/v3/__https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud3U6Hw5A$>
>>>>> 
>>>>> On Thu, May 4, 2023 at 10:25 AM David Capwell <dcapw...@apple.com> wrote:
>>>>>> My views have changed over time on syntax and I feel type[dimention] may 
>>>>>> not be the best, so it has gone lower in my own personal ranking… this 
>>>>>> is my current preference
>>>>>> 
>>>>>> 1) DENSE <type>[dimention] | NON NULL <type>[dimention]
>>>>>> 2) VECTOR<type, dimention>
>>>>>> 3) type[dimention]
>>>>>> 
>>>>>> My reasoning for this order
>>>>>> 
>>>>>> * type[dimention] looks like syntax sugar for array<type, dimention>, so 
>>>>>> users may assume list/array semantics, but we limit to non-null elements 
>>>>>> in a frozen array
>>>>>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct 
>>>>>> type makes more sense… this also leads to a possible future of 
>>>>>> VECTOR<type> which is the non-fixed length version of this type.  What 
>>>>>> makes VECTOR different from list/array?  non-null elements and is 
>>>>>> frozen.  I don’t feel that VECTOR really tells users to expect non-null 
>>>>>> or frozen semantics, as there exists different VECTOR types for those 
>>>>>> reasons (sparse vs dense)… 
>>>>>> * DENSE may be confusing for people coming from languages where this 
>>>>>> just means “sequential layout”, which is what our frozen array/list 
>>>>>> already are… but since the target user is coming from a ML background, 
>>>>>> this shouldn’t offer much confusion.  DENSE just means FROZEN in 
>>>>>> Cassandra, with NON NULL elements (SPARSE allows for NULL and isn’t 
>>>>>> frozen)… So DENSE just acts as syntax sugar for frozen<non null 
>>>>>> type[dimention]>
>>>>>> 
>>>>>> 
>>>>>>> On May 4, 2023, at 4:13 AM, Brandon Williams <dri...@gmail.com> wrote:
>>>>>>> 
>>>>>>> 1. VECTOR<FLOAT,n>
>>>>>>> 2. VECTOR FLOAT[n]
>>>>>>> 3. FLOAT[N]   (Non null by default)
>>>>>>> 
>>>>>>> Redundant or not, I think having the VECTOR keyword helps signify what
>>>>>>> the app is generally about and helps get buy-in from ML stakeholders.
>>>>>>> 
>>>>>>> On Thu, May 4, 2023 at 3:45 AM Benedict <bened...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> Hurrah for initial agreement.
>>>>>>>> 
>>>>>>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], 
>>>>>>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t 
>>>>>>>> think VECTOR should be used to simply imply non-null, as this would be 
>>>>>>>> very unintuitive. More logical would be NONNULL, if this is the only 
>>>>>>>> condition being applied. Alternatively for arrays we could default to 
>>>>>>>> NONNULL and later introduce NULLABLE if we want to permit nulls.
>>>>>>>> 
>>>>>>>> If the word vector is to be used it makes more sense to make it look 
>>>>>>>> like a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly 
>>>>>>>> not redundant.
>>>>>>>> 
>>>>>>>> So, I vote:
>>>>>>>> 
>>>>>>>> 1) (NON NULL) FLOAT[N]
>>>>>>>> 2) FLOAT[N]   (Non null by default)
>>>>>>>> 3) VECTOR<FLOAT, N>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Did we agree on a CQL syntax?
>>>>>>>>> 
>>>>>>>>> I don’t believe there has been a pool on CQL syntax… my understanding 
>>>>>>>>> reading all the threads is that there are ~4-5 options and non are 
>>>>>>>>> -1ed, so believe we are waiting for majority rule on this?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Re-reading that thread, IIUC the valid choices remaining are…
>>>>>>>> 
>>>>>>>> 1. VECTOR FLOAT[n]
>>>>>>>> 2. FLOAT VECTOR[n]
>>>>>>>> 3. VECTOR<FLOAT,n>
>>>>>>>> 4. VECTOR[n]<FLOAT>
>>>>>>>> 5. ARRAY<FLOAT, n>
>>>>>>>> 6. NON-NULL FROZEN<FLOAT[n]>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yes I'm putting my preference (1) first ;) because (banging on) if the 
>>>>>>>> future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the 
>>>>>>>> VECTOR keyword is: for general cql users; just meaning "non-null and 
>>>>>>>> frozen", these gel best together.
>>>>>>>> 
>>>>>>>> Options (5) and (6) are for those that feel we can and should provide 
>>>>>>>> this type without introducing the vector keyword.
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>> 
>> 
>> --
>> DataStax Logo Square <https://www.datastax.com/>
>> *Mike Adamson*
>> Engineering
>> +1 650 389 6000 <tel:16503896000> | datastax.com <https://www.datastax.com/>
>> Find DataStax Online:
>> LinkedIn Logo 
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>>    Facebook Logo 
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>>    Twitter Logo <https://twitter.com/DataStax>   RSS Feed 
>> <https://www.datastax.com/blog/rss.xml>   Github Logo 
>> <https://github.com/datastax>

Reply via email to