Not sure if this is what you are saying, Josh, but I believe this needs to
be its own CEP. It's a change in CQL syntax and changes how clusters
operate. The change needs to be documented and voted on. Jonathan, you know
how to find me if you want me to help write it. :)

As a side comment to all of this, last ApacheCon in New Orleans,
Jordan West, Alex Petrov, and I were sitting in the hall track and were
having a discussion about just what we have in Cassandra. There is no other
system like Cassandra, and for scale and distributed data, it stands alone.
What do we do with a robust baseline like this?

This! This is a great example!

Patrick


On Tue, Apr 25, 2023 at 5:03 PM Josh McKenzie <jmcken...@apache.org> wrote:

> To be fair Dinesh kind of primed that:
>
> Do you intend to make this part of CEP-7 or as an incremental update to
> SAI once it is committed?
>
> ;)
>
> I think this body of work more than stands on its own. Great work
> Jonathan, Mike, and Zhao; having native support for more ML-oriented
> workloads in C* would be a big win for a bunch of our users and plays into
> our architectural strengths in a lot of ways too.
>
> On Tue, Apr 25, 2023, at 7:35 PM, Henrik Ingo wrote:
>
> Jonathan what a great proposal/code. An enjoyable read. And at least for
> me educational! (Which is notable, as you're on my turf, I'm a Data Science
> major.)
>
> Sorry for splitting hairs but CEP-7 (as a spec, and wiki page) is approved
> and voted on and I assume there's no proposal to change that. That said,
> work of course continues beyond CEP-7 and this is not the only SAI feature
> that adds on top of the CEP-7 foundation.
>
> I just wanted to clarify so there's no confusion later.
>
> henrik
>
> On Sat, Apr 22, 2023 at 10:41 PM Jonathan Ellis <jbel...@gmail.com> wrote:
>
> My guess is that I will be able to get this ready to upstream before the
> rest of CEP-7 goes in, so it would make sense to me to roll it into that.
>
> On Fri, Apr 21, 2023 at 5:34 PM Dinesh Joshi <djo...@apache.org> wrote:
>
> Interesting proposal Jonathan. Will grok it over the weekend and play
> around with the branch.
>
> Do you intend to make this part of CEP-7 or as an incremental update to
> SAI once it is committed?
>
> On Apr 21, 2023, at 2:19 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
> Happy Friday, everyone!
>
> Rich text formatting ahead, I've attached a PDF for those who prefer that.
>
>
> I propose adding approximate nearest neighbor (ANN) vector search
> capability to Apache Cassandra via storage-attached indexes (SAI). This is
> a medium-sized effort that will significantly enhance Cassandra’s
> functionality, particularly for AI use cases. This addition will not only
> provide a new and important feature for existing Cassandra users, but also
> attract new users to the community from the AI space, further expanding
> Cassandra’s reach and relevance.
> Introduction
> Vector search is a powerful document search technique that enables
> developers to quickly find relevant content within an extensive collection
> of documents, which is useful as a standalone technique, but it is
> particularly hot now because it significantly enhances the performance of
> LLMs.
>
> Vector search uses ML models to match the semantics of a question rather
> than just the words it contains, avoiding the classic false positives and
> false negatives associated with term-based search.  Alessandro Benedetti
> gives some good examples in his *excellent talk*
> <https://urldefense.com/v3/__https://www.youtube.com/watch?v=z-i8mOHAhlU__;!!PbtH5S7Ebw!fGOkCfeX4xWlavpoBz8Q24N4rJ1QKFlHAqMhO6h89IfCtjhQJP2Es0k7PAEGKIfU49s0O7KE0tA5V5bm-g$>
> :
> <image.png>
>
> <image.png>
>
> You can search across any set of vectors, which are just ordered sets of
> numbers.  In the context of natural language queries and document search,
> we are specifically concerned with a type of vector called an *embedding*
> .
>
> An embedding is a high-dimensional vector that captures the underlying
> semantic relationships and contextual information of words or phrases.
> Embeddings are generated by ML models trained for this purpose; OpenAI
> provides an API to do this, but open-source and self-hostable models like
> BERT are also popular. Creating more accurate and smaller embeddings are
> active research areas in ML.
>
> Large language models (LLMs) can be described as a mile wide and an inch
> deep. They are not experts on any narrow domain (although they will
> hallucinate that they are, sometimes convincingly).  You can remedy this by
> giving the LLM additional context for your query, but the context window is
> small (4k tokens for GPT-3.5, up to 32k for GPT-4), so you want to be very
> selective about giving the LLM the most relevant possible information.
>
> Vector search is red-hot now because it allows us to easily answer the
> question “what are the most relevant documents to provide as context” by
> performing a similarity search between the embeddings vector of the query,
> and those of your document universe.  Doing exact search is prohibitively
> expensive, since you necessarily have to compare with each and every
> document; this is intractable when you have billions or trillions of docs.
> However, there are well-understood algorithms for turning this into a
> logarithmic problem if you are willing to accept *approximately *the most
> similar documents.  This is the “approximate nearest neighbor” problem.
> (You will see these referred to as kNN – k nearest neighbors – or ANN.)
>
> *Pinecone DB has a good example of what this looks like in Python code*
> <https://urldefense.com/v3/__https://docs.pinecone.io/docs/gen-qa-openai__;!!PbtH5S7Ebw!fGOkCfeX4xWlavpoBz8Q24N4rJ1QKFlHAqMhO6h89IfCtjhQJP2Es0k7PAEGKIfU49s0O7KE0tD9egmitQ$>
> .
>
> Vector search is the foundation underlying effectively all of the AI
> applications that are launching now.  This is particularly relevant to
> Apache Cassandra users, who tend to manage the types of large datasets that
> benefit the most from fast similarity search. Adding vector search to
> Cassandra’s unique strengths of scale, reliability, and low latency, will
> further enhance its appeal and effectiveness for these users while also
> making it more attractive to newcomers looking to harness AI’s potential.
> The faster we deliver vector search, the more valuable it will be for this
> expanding user base.
> Requirements
>
>    1. Perform vector search as outlined in the Pinecone example above
>    1. Support Float32 embeddings in the form of a new DENSE FLOAT32 cql
>       type
>       1. This is also useful for “classic” ML applications that derive
>          and serve their own feature vectors
>          2. Add ANN (approximate nearest neighbor) search.
>       2. Work with normal Cassandra data flow
>    1. Inserting one row at a time is fine; cannot require batch ingest
>       2. Updating, deleting rows is also fine
>       3. Must compose with other SAI predicates as well as partition keys
>
> Not requirements
>
>    1. Other datatypes besides Float32
>    1. *Pinecone supports only Float32 and it’s hugely ahead in mindshare
>       so let’s make things easy on ourselves and follow their precedent.*
>       2. I don’t want to scope creep beyond ANN. In particular, I don’t
>    want to wait for ORDER BY to get exact search in as well.
>
> How we can do this
> There is exactly one production quality, actively maintained,
> state-of-the-art ANN library for Java: Lucene’s HNSW.  Elastic has been GA
> with ANN search backed by this library for just over a year; Solr added it
> not long after.  As usual with Lucene code, it is not super concerned with
> making it usable by third parties like Cassandra, but HNSW offers public
> APIs that are low level enough that we don’t have to bring in Lucene
> baggage that we don’t want.
>
> The SAI framework is flexible enough to allow plugging in an HNSW vector
> index while enjoying the baseline benefits of SAI and playing nice with the
> other SAI index types.
> What I have today
> I have a pre-alpha branch that demonstrates
>
>    1. A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type (internally).
>    2. A new SAI index, VectorMemtableIndex
>    3. A new CQL operator, ANN
>
>
> So this works:
> cqlsh> create KEYSPACE test WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': 1};
> cqlsh> use test;
>
> cqlsh:test> create table foo(i int primary key, j dense float32);
> cqlsh:test> create custom index ann_index on foo(j) using
> 'StorageAttachedIndex';
>
> cqlsh:test> insert into foo (i, j) values (1, [8, 2.3, 58]);
> cqlsh:test> insert into foo (i, j) values (2, [1.2, 3.4, 5.6]);
> cqlsh:test> insert into foo (i, j) values (5, [23, 18, 3.9]);
> cqlsh:test> select * from foo where j ann [3.4, 7.8, 9.1] limit 1;
>
>  i | j
> ---+---------------------------------------------------------
>  5 | b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a'
>
> Anything else you can imagine asking "does it do X" the answer is no, it
> doesn't.  It doesn't save to disk, it doesn't compose with other
> predicates, it doesn't have driver support. That last is why the answer
> comes back as raw hex bytes.  The CQL parser handling the inserts is
> server-side and doesn't need driver support, is why we can do inserts and
> queries at all.
>
> (If you’re tempted to try this out, remember that “it doesn’t save to
> disk” means that if you do inserts first and then create the index, instead
> of the other way around, it will break, because we flush existing data as
> part of adding a new index to a table.  So it’s quite fragile at the
> moment!)
>
> The branch is public at
> *https://github.com/datastax/cassandra/tree/vsearch*
> <https://github.com/jbellis/cassandra/tree/vsearch>
>
> I based it on the DataStax converged cassandra repo because that’s the
> only place to get fully working SAI for now.  We plan to upstream vector
> search with the rest of SAI as soon as it’s ready.
> What’s next
> I think we can get to alpha pretty quickly:
>
>    1. Add support for flushing, and serving queries from disk without
>    reading the whole graph into memory (Lucene does this already, we will
>    probably have to rewrite a bunch of the code for use in Cassandra, but the
>    design is vetted and works).
>
>
>    1. Make sure that scatter/gather works across nodes, this should Just
>    Work but I’ve already needed to tweak a couple things that I thought would
>    just work, so I’m calling this out.
>
> And then to be feature complete we would want to add:
>
>    1. CQL support for specifying the dimensionality of a column up front,
>    e.g. DENSE FLOAT32(1500).
>    1. *Pinecone accepts the first vector as the Correct dimensions, and
>       throws an error if you give it one that doesn’t match, but this doesn’t
>       work in a distributed system where the second vector might go to a
>       different node than the first.  Elastic requires specifying it up front
>       like I propose here.*
>       2. Driver support
>    3. Support for updating the vector in a row that hasn’t been flushed
>    yet, either by updating HNSW to support deletes, or by adding some kind of
>    invalidation marker to the overwritten vector.
>    4. More performant inserts.  Currently I have a big synchronized lock
>    around the HNSW graph.  So we either need to shard it, like we do for the
>    other SAI indexes, or add fine-grained locking like ConcurrentSkipListMap
>    to make HnswGraphBuilder concurrent-ish.  I prefer the second option, since
>    it allows us to avoid building the graph once in memory and then a second
>    time on flush, but it might be more work than it appears.
>    5. Add index configurability options: similarity function and HNSW
>    parameters M and ef.
>    6. Expose the similarity functions to CQL so you can SELECT x,
>    cosine_similarity(x, query_vector) FROM …
>
> Special thanks to
> Mike Adamson and Zhao Yang made substantial contributions to the branch
> you see here and provided indispensable help understanding SAI.
>
> <Vector search for Cassandra.pdf>
>
>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
>
>
> --
>
>
>
> *Henrik Ingo*
>
> *c*. +358 40 569 7354
>
> *w*. *www.datastax.com <http://www.datastax.com>*
>
> <https://www.facebook.com/datastax>  <https://twitter.com/datastax>
> <https://www.linkedin.com/company/datastax/>
> <https://github.com/datastax/>
>
>
>

Reply via email to