LOL, I'm holding you to that at the summit :) In all seriousness, I'm glad
to see a robust debate around it. I guess for completeness, my order of
preference is

1 - NONNULL FROZEN<TYPE<N>>
2 - NONNULL TYPE<N> (which part of this implies frozen? The NONNULL or the
cardinality?)
3 - DENSE_VECTOR<type, N>

I guess my main concern with just "VECTOR" is that it's such an overloaded
term. Maybe in ML it means something specific, but for anyone coming from
C++, Rust, Java, etc, a Vector is both mutable and can carry null (or
equivalent, e.g. None, in Rust). If the argument hadn't also been made that
we should be working toward something that's not ML-specific maybe I would
be less concerned.

Cheers,

Derek


Cheers,

Derek

On Fri, May 5, 2023 at 11:14 AM Patrick McFadin <pmcfa...@gmail.com> wrote:

> Derek, despite your preference, I would hang out with you at a party.
>
> On Fri, May 5, 2023 at 9:44 AM Derek Chen-Becker <de...@chen-becker.org>
> wrote:
>
>> Speaking as someone who likes Erlang, maybe that's why I also like
>> NONNULL FROZEN<TYPE<[n]>>. It's unambiguous what Cassandra is going to do
>> with that type. DENSE VECTOR means I need to go read docs (and then
>> probably double-check in the source to be sure) to be sure what exactly is
>> going on.
>>
>> Cheers,
>>
>> Derek
>>
>> On Fri, May 5, 2023 at 9:54 AM Patrick McFadin <pmcfa...@gmail.com>
>> wrote:
>>
>>> I hope we are willing to consider developers that use our system because
>>> if I had to teach people to use "NON-NULL FROZEN<TYPE[n]>" I'm pretty sure
>>> the response would be:
>>>
>>> Did you tell me to go write a distributed map-reduce job in Erlang? I
>>> beleive I did, Bob.
>>>
>>> On Fri, May 5, 2023 at 8:05 AM Josh McKenzie <jmcken...@apache.org>
>>> wrote:
>>>
>>>> Idiomatically, to my mind, there's a question of "what space are we
>>>> thinking about this datatype in"?
>>>>
>>>> - In the context of mathematics, nullability in a vector would be 0
>>>> - In the context of Cassandra, nullability tends to mean a tombstone
>>>> (or nothing)
>>>> - In the context of programming languages, it's all over the place
>>>>
>>>> Given many models are exploring quantizing to int8 and other data
>>>> types, there's definitely the "support other data types easily in the
>>>> future" piece to me we need to keep in mind.
>>>>
>>>> So with the above and the "meet the user where they are and don't make
>>>> them understand more of Cassandra than absolutely critical to use it", I
>>>> lean:
>>>>
>>>> 1. DENSE_VECTOR<type, dimension>
>>>> 2. VECTOR<type, dimension>
>>>> 3. type[dimension]
>>>>
>>>> This leaves the path open for us to expand on it in the future with
>>>> sparse support and allows us to introduce some semantics that indicate
>>>> idioms around nullability for the users coming from a different space.
>>>>
>>>> "NON-NULL FROZEN<TYPE[n]>" is strictly correct, however it requires
>>>> understanding idioms of how Cassandra thinks about data (nulls mean
>>>> different things to us, we have differences between frozen and non-frozen
>>>> due to constraints in our storage engine and materialization of data, etc)
>>>> that get in the way of users doing things in the pattern they're familiar
>>>> with without learning more about the DB than they're probably looking to
>>>> learn. Historically this has been a challenge for us in adoption; the
>>>> classic "Why can't I just write and delete and write as much as I want? Why
>>>> are deletes filling up my disk?" problem comes to mind.
>>>>
>>>> I'd also be happy with us supporting:
>>>> * NON-NULL FROZEN<TYPE[n]>
>>>> * DENSE_VECTOR<type, dimension> as syntactic sugar for the above
>>>>
>>>> If getting into the "built-in syntactic sugar mapping for communities
>>>> and specific use-cases" is something we're willing to consider.
>>>>
>>>> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote:
>>>>
>>>> I think we are still discussing implementation here when I'm talking
>>>> about developer experience. I want developers to adopt this quickly, easily
>>>> and be successful. Vector search is already a thing. People use it every
>>>> day. A successful outcome, in my view, is developers picking up this
>>>> feature without reading a manual. (Because they don't anyway and get in
>>>> trouble) I did some more extensive research about what other DBs are using
>>>> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE'
>>>>
>>>> Pinecone[1] - dense_vector, sparse_vector
>>>> Elastic[2]: dense_vector
>>>> Milvus[3]: float_vector, binary_vector
>>>> pgvector[4]: vector
>>>> Weaviate[5]: Different approach. All typed arrays can be indexed
>>>>
>>>> Based on that I'm advocating a similar syntax:
>>>>
>>>> - DENSE VECTOR
>>>> or
>>>> - VECTOR
>>>>
>>>> [1] https://docs.pinecone.io/docs/hybrid-search
>>>> [2]
>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
>>>> [3] https://milvus.io/docs/create_collection.md
>>>> [4] https://github.com/pgvector/pgvector
>>>> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes
>>>>
>>>> On Fri, May 5, 2023 at 6:07 AM Mike Adamson <madam...@datastax.com>
>>>> wrote:
>>>>
>>>> Then we can have the indexing apparatus only accept *frozen<float[n]>* for
>>>> the HSNW case.
>>>>
>>>> I'm inclined to agree with Benedict that the index will need to be
>>>> specifically select by option rather than inferred based on type. As such
>>>> there is no real reason for the *frozen* requirement on the type. The
>>>> hnsw index can be built just as easily from a non-frozen array.
>>>>
>>>> I am in favour of enforcing non-null on the elements of an array by
>>>> default. I would prefer that allowing nulls in the array would be a later
>>>> addition if and when a use case arose for it.
>>>>
>>>> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe <calebrackli...@gmail.com>
>>>> wrote:
>>>>
>>>> Even in the ML case, sparse can just mean zeros rather than nulls, and
>>>> they should compress similarly anyway.
>>>>
>>>> If we really want null values, I'd rather leave that in collections
>>>> space.
>>>>
>>>> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <
>>>> calebrackli...@gmail.com> wrote:
>>>>
>>>> I actually still prefer *type[dimension]*, because I think I
>>>> intuitively read this as a primitive (meaning no null elements) array. Then
>>>> we can have the indexing apparatus only accept *frozen<float[n]>* for
>>>> the HSNW case.
>>>>
>>>> If that isn't intuitive to anyone else, I don't really have a strong
>>>> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One
>>>> should indicate single vs. multi-cell, and the other the presence or
>>>> absence of nulls/zeros/whatever.
>>>>
>>>> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <pmcfa...@gmail.com>
>>>> wrote:
>>>>
>>>> I agree with David's reasoning and the use of DENSE (and maybe
>>>> eventually SPARSE). This is terminology well established in the data world,
>>>> and it would lead to much easier adoption from users. VECTOR is close, but
>>>> I can see having to create a lot of content around "How to use it and not
>>>> get in trouble." (I have a lot of that content already)
>>>>
>>>>  - We don't have to explain what it is. A lot of prior art out there
>>>> already [1][2][3]
>>>>  - We're matching an established term with what users would expect. No
>>>> surprises.
>>>>  - Shorter ramp-up time for users. Cassandra is being modernized.
>>>>
>>>> The implementation is flexible, but the interface should empower our
>>>> users to be awesome.
>>>>
>>>> Patrick
>>>>
>>>> 1 -
>>>> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks
>>>> <https://urldefense.com/v3/__https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud6ieKGQw$>
>>>> 2 -
>>>> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035
>>>> <https://urldefense.com/v3/__https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ue1o2CO2Q$>
>>>> 3 -
>>>> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/
>>>> <https://urldefense.com/v3/__https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud3U6Hw5A$>
>>>>
>>>> On Thu, May 4, 2023 at 10:25 AM David Capwell <dcapw...@apple.com>
>>>> wrote:
>>>>
>>>> My views have changed over time on syntax and I feel type[dimention]
>>>> may not be the best, so it has gone lower in my own personal ranking… this
>>>> is my current preference
>>>>
>>>> 1) DENSE <type>[dimention] | NON NULL <type>[dimention]
>>>> 2) VECTOR<type, dimention>
>>>> 3) type[dimention]
>>>>
>>>> My reasoning for this order
>>>>
>>>> * type[dimention] looks like syntax sugar for array<type, dimention>,
>>>> so users may assume list/array semantics, but we limit to non-null elements
>>>> in a frozen array
>>>> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct
>>>> type makes more sense… this also leads to a possible future of VECTOR<type>
>>>> which is the non-fixed length version of this type.  What makes VECTOR
>>>> different from list/array?  non-null elements and is frozen.  I don’t feel
>>>> that VECTOR really tells users to expect non-null or frozen semantics, as
>>>> there exists different VECTOR types for those reasons (sparse vs dense)…
>>>> * DENSE may be confusing for people coming from languages where this
>>>> just means “sequential layout”, which is what our frozen array/list already
>>>> are… but since the target user is coming from a ML background, this
>>>> shouldn’t offer much confusion.  DENSE just means FROZEN in Cassandra, with
>>>> NON NULL elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just
>>>> acts as syntax sugar for frozen<non null type[dimention]>
>>>>
>>>>
>>>> On May 4, 2023, at 4:13 AM, Brandon Williams <dri...@gmail.com> wrote:
>>>>
>>>> 1. VECTOR<FLOAT,n>
>>>> 2. VECTOR FLOAT[n]
>>>> 3. FLOAT[N]   (Non null by default)
>>>>
>>>> Redundant or not, I think having the VECTOR keyword helps signify what
>>>> the app is generally about and helps get buy-in from ML stakeholders.
>>>>
>>>> On Thu, May 4, 2023 at 3:45 AM Benedict <bened...@apache.org> wrote:
>>>>
>>>>
>>>> Hurrah for initial agreement.
>>>>
>>>> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N],
>>>> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t
>>>> think VECTOR should be used to simply imply non-null, as this would be very
>>>> unintuitive. More logical would be NONNULL, if this is the only condition
>>>> being applied. Alternatively for arrays we could default to NONNULL and
>>>> later introduce NULLABLE if we want to permit nulls.
>>>>
>>>> If the word vector is to be used it makes more sense to make it look
>>>> like a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not
>>>> redundant.
>>>>
>>>> So, I vote:
>>>>
>>>> 1) (NON NULL) FLOAT[N]
>>>> 2) FLOAT[N]   (Non null by default)
>>>> 3) VECTOR<FLOAT, N>
>>>>
>>>>
>>>>
>>>> On 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org> wrote:
>>>>
>>>> 
>>>>
>>>>
>>>> Did we agree on a CQL syntax?
>>>>
>>>> I don’t believe there has been a pool on CQL syntax… my understanding
>>>> reading all the threads is that there are ~4-5 options and non are -1ed, so
>>>> believe we are waiting for majority rule on this?
>>>>
>>>>
>>>>
>>>>
>>>> Re-reading that thread, IIUC the valid choices remaining are…
>>>>
>>>> 1. VECTOR FLOAT[n]
>>>> 2. FLOAT VECTOR[n]
>>>> 3. VECTOR<FLOAT,n>
>>>> 4. VECTOR[n]<FLOAT>
>>>> 5. ARRAY<FLOAT, n>
>>>> 6. NON-NULL FROZEN<FLOAT[n]>
>>>>
>>>>
>>>> Yes I'm putting my preference (1) first ;) because (banging on) if the
>>>> future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR
>>>> keyword is: for general cql users; just meaning "non-null and frozen",
>>>> these gel best together.
>>>>
>>>> Options (5) and (6) are for those that feel we can and should provide
>>>> this type without introducing the vector keyword.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> [image: DataStax Logo Square] <https://www.datastax.com/>
>>>> *Mike Adamson*
>>>> Engineering
>>>> +1 650 389 6000 <16503896000> | datastax.com
>>>> <https://www.datastax.com/>
>>>> Find DataStax Online:
>>>> [image: LinkedIn Logo]
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>>>>    [image: Facebook Logo]
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>>>>    [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS
>>>> Feed] <https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
>>>> <https://github.com/datastax>
>>>>
>>>>
>>>>
>>
>> --
>> +---------------------------------------------------------------+
>> | Derek Chen-Becker                                             |
>> | GPG Key available at https://keybase.io/dchenbecker and       |
>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>> +---------------------------------------------------------------+
>>
>>

-- 
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Reply via email to