Speaking as someone who likes Erlang, maybe that's why I also like NONNULL FROZEN<TYPE<[n]>>. It's unambiguous what Cassandra is going to do with that type. DENSE VECTOR means I need to go read docs (and then probably double-check in the source to be sure) to be sure what exactly is going on.
Cheers, Derek On Fri, May 5, 2023 at 9:54 AM Patrick McFadin <pmcfa...@gmail.com> wrote: > I hope we are willing to consider developers that use our system because > if I had to teach people to use "NON-NULL FROZEN<TYPE[n]>" I'm pretty sure > the response would be: > > Did you tell me to go write a distributed map-reduce job in Erlang? I > beleive I did, Bob. > > On Fri, May 5, 2023 at 8:05 AM Josh McKenzie <jmcken...@apache.org> wrote: > >> Idiomatically, to my mind, there's a question of "what space are we >> thinking about this datatype in"? >> >> - In the context of mathematics, nullability in a vector would be 0 >> - In the context of Cassandra, nullability tends to mean a tombstone (or >> nothing) >> - In the context of programming languages, it's all over the place >> >> Given many models are exploring quantizing to int8 and other data types, >> there's definitely the "support other data types easily in the future" >> piece to me we need to keep in mind. >> >> So with the above and the "meet the user where they are and don't make >> them understand more of Cassandra than absolutely critical to use it", I >> lean: >> >> 1. DENSE_VECTOR<type, dimension> >> 2. VECTOR<type, dimension> >> 3. type[dimension] >> >> This leaves the path open for us to expand on it in the future with >> sparse support and allows us to introduce some semantics that indicate >> idioms around nullability for the users coming from a different space. >> >> "NON-NULL FROZEN<TYPE[n]>" is strictly correct, however it requires >> understanding idioms of how Cassandra thinks about data (nulls mean >> different things to us, we have differences between frozen and non-frozen >> due to constraints in our storage engine and materialization of data, etc) >> that get in the way of users doing things in the pattern they're familiar >> with without learning more about the DB than they're probably looking to >> learn. Historically this has been a challenge for us in adoption; the >> classic "Why can't I just write and delete and write as much as I want? Why >> are deletes filling up my disk?" problem comes to mind. >> >> I'd also be happy with us supporting: >> * NON-NULL FROZEN<TYPE[n]> >> * DENSE_VECTOR<type, dimension> as syntactic sugar for the above >> >> If getting into the "built-in syntactic sugar mapping for communities and >> specific use-cases" is something we're willing to consider. >> >> On Fri, May 5, 2023, at 7:26 AM, Patrick McFadin wrote: >> >> I think we are still discussing implementation here when I'm talking >> about developer experience. I want developers to adopt this quickly, easily >> and be successful. Vector search is already a thing. People use it every >> day. A successful outcome, in my view, is developers picking up this >> feature without reading a manual. (Because they don't anyway and get in >> trouble) I did some more extensive research about what other DBs are using >> for syntax. The consensus is some variety of 'VECTOR', 'DENSE' and 'SPARSE' >> >> Pinecone[1] - dense_vector, sparse_vector >> Elastic[2]: dense_vector >> Milvus[3]: float_vector, binary_vector >> pgvector[4]: vector >> Weaviate[5]: Different approach. All typed arrays can be indexed >> >> Based on that I'm advocating a similar syntax: >> >> - DENSE VECTOR >> or >> - VECTOR >> >> [1] https://docs.pinecone.io/docs/hybrid-search >> [2] >> https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html >> [3] https://milvus.io/docs/create_collection.md >> [4] https://github.com/pgvector/pgvector >> [5] https://weaviate.io/developers/weaviate/config-refs/datatypes >> >> On Fri, May 5, 2023 at 6:07 AM Mike Adamson <madam...@datastax.com> >> wrote: >> >> Then we can have the indexing apparatus only accept *frozen<float[n]>* for >> the HSNW case. >> >> I'm inclined to agree with Benedict that the index will need to be >> specifically select by option rather than inferred based on type. As such >> there is no real reason for the *frozen* requirement on the type. The >> hnsw index can be built just as easily from a non-frozen array. >> >> I am in favour of enforcing non-null on the elements of an array by >> default. I would prefer that allowing nulls in the array would be a later >> addition if and when a use case arose for it. >> >> On Fri, 5 May 2023 at 03:02, Caleb Rackliffe <calebrackli...@gmail.com> >> wrote: >> >> Even in the ML case, sparse can just mean zeros rather than nulls, and >> they should compress similarly anyway. >> >> If we really want null values, I'd rather leave that in collections space. >> >> On Thu, May 4, 2023 at 8:59 PM Caleb Rackliffe <calebrackli...@gmail.com> >> wrote: >> >> I actually still prefer *type[dimension]*, because I think I intuitively >> read this as a primitive (meaning no null elements) array. Then we can have >> the indexing apparatus only accept *frozen<float[n]>* for the HSNW case. >> >> If that isn't intuitive to anyone else, I don't really have a strong >> opinion...but...conflating "frozen" and "dense" seems like a bad idea. One >> should indicate single vs. multi-cell, and the other the presence or >> absence of nulls/zeros/whatever. >> >> On Thu, May 4, 2023 at 12:51 PM Patrick McFadin <pmcfa...@gmail.com> >> wrote: >> >> I agree with David's reasoning and the use of DENSE (and maybe eventually >> SPARSE). This is terminology well established in the data world, and it >> would lead to much easier adoption from users. VECTOR is close, but I can >> see having to create a lot of content around "How to use it and not get in >> trouble." (I have a lot of that content already) >> >> - We don't have to explain what it is. A lot of prior art out there >> already [1][2][3] >> - We're matching an established term with what users would expect. No >> surprises. >> - Shorter ramp-up time for users. Cassandra is being modernized. >> >> The implementation is flexible, but the interface should empower our >> users to be awesome. >> >> Patrick >> >> 1 - >> https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks >> <https://urldefense.com/v3/__https://stats.stackexchange.com/questions/266996/what-do-the-terms-dense-and-sparse-mean-in-the-context-of-neural-networks__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud6ieKGQw$> >> 2 - >> https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035 >> <https://urldefense.com/v3/__https://induraj2020.medium.com/what-are-sparse-features-and-dense-features-8d1746a77035__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ue1o2CO2Q$> >> 3 - >> https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/ >> <https://urldefense.com/v3/__https://revware.net/sparse-vs-dense-data-the-power-of-points-and-clouds/__;!!PbtH5S7Ebw!dpAaXazB6qZfr_FdkU9ThEq4X0DDTa-DlNvF5V4AvTiZSpHeYn6zqhFD4ZVaRLYoQBmNTn7n6jt5ymZs5Ud3U6Hw5A$> >> >> On Thu, May 4, 2023 at 10:25 AM David Capwell <dcapw...@apple.com> wrote: >> >> My views have changed over time on syntax and I feel type[dimention] may >> not be the best, so it has gone lower in my own personal ranking… this is >> my current preference >> >> 1) DENSE <type>[dimention] | NON NULL <type>[dimention] >> 2) VECTOR<type, dimention> >> 3) type[dimention] >> >> My reasoning for this order >> >> * type[dimention] looks like syntax sugar for array<type, dimention>, so >> users may assume list/array semantics, but we limit to non-null elements in >> a frozen array >> * feel VECTOR as a prefix feels out of place, but VECTOR as a direct type >> makes more sense… this also leads to a possible future of VECTOR<type> >> which is the non-fixed length version of this type. What makes VECTOR >> different from list/array? non-null elements and is frozen. I don’t feel >> that VECTOR really tells users to expect non-null or frozen semantics, as >> there exists different VECTOR types for those reasons (sparse vs dense)… >> * DENSE may be confusing for people coming from languages where this just >> means “sequential layout”, which is what our frozen array/list already are… >> but since the target user is coming from a ML background, this shouldn’t >> offer much confusion. DENSE just means FROZEN in Cassandra, with NON NULL >> elements (SPARSE allows for NULL and isn’t frozen)… So DENSE just acts as >> syntax sugar for frozen<non null type[dimention]> >> >> >> On May 4, 2023, at 4:13 AM, Brandon Williams <dri...@gmail.com> wrote: >> >> 1. VECTOR<FLOAT,n> >> 2. VECTOR FLOAT[n] >> 3. FLOAT[N] (Non null by default) >> >> Redundant or not, I think having the VECTOR keyword helps signify what >> the app is generally about and helps get buy-in from ML stakeholders. >> >> On Thu, May 4, 2023 at 3:45 AM Benedict <bened...@apache.org> wrote: >> >> >> Hurrah for initial agreement. >> >> For syntax, I think one option was just FLOAT[N]. In VECTOR FLOAT[N], >> VECTOR is redundant - FLOAT[N] is fully descriptive by itself. I don’t >> think VECTOR should be used to simply imply non-null, as this would be very >> unintuitive. More logical would be NONNULL, if this is the only condition >> being applied. Alternatively for arrays we could default to NONNULL and >> later introduce NULLABLE if we want to permit nulls. >> >> If the word vector is to be used it makes more sense to make it look like >> a list, so VECTOR<FLOAT, N> as here the word VECTOR is clearly not >> redundant. >> >> So, I vote: >> >> 1) (NON NULL) FLOAT[N] >> 2) FLOAT[N] (Non null by default) >> 3) VECTOR<FLOAT, N> >> >> >> >> On 4 May 2023, at 08:52, Mick Semb Wever <m...@apache.org> wrote: >> >> >> >> >> Did we agree on a CQL syntax? >> >> I don’t believe there has been a pool on CQL syntax… my understanding >> reading all the threads is that there are ~4-5 options and non are -1ed, so >> believe we are waiting for majority rule on this? >> >> >> >> >> Re-reading that thread, IIUC the valid choices remaining are… >> >> 1. VECTOR FLOAT[n] >> 2. FLOAT VECTOR[n] >> 3. VECTOR<FLOAT,n> >> 4. VECTOR[n]<FLOAT> >> 5. ARRAY<FLOAT, n> >> 6. NON-NULL FROZEN<FLOAT[n]> >> >> >> Yes I'm putting my preference (1) first ;) because (banging on) if the >> future of CQL will have FLOAT[n] and FROZEN<FLOAT[n]>, where the VECTOR >> keyword is: for general cql users; just meaning "non-null and frozen", >> these gel best together. >> >> Options (5) and (6) are for those that feel we can and should provide >> this type without introducing the vector keyword. >> >> >> >> >> >> -- >> [image: DataStax Logo Square] <https://www.datastax.com/> >> *Mike Adamson* >> Engineering >> +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/> >> Find DataStax Online: >> [image: LinkedIn Logo] >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=> >> [image: Facebook Logo] >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=> >> [image: Twitter Logo] <https://twitter.com/DataStax> [image: RSS >> Feed] <https://www.datastax.com/blog/rss.xml> [image: Github Logo] >> <https://github.com/datastax> >> >> >> -- +---------------------------------------------------------------+ | Derek Chen-Becker | | GPG Key available at https://keybase.io/dchenbecker and | | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | +---------------------------------------------------------------+