Re: [DISCUSS] New data type for vector search

Benedict Fri, 28 Apr 2023 09:33:27 -0700

pgvector is a plug-in. If you were proposing a plug-in you could ignore these considerations.

On 28 Apr 2023, at 16:58, Jonathan Ellis <[email protected]> wrote:

I'm proposing a vector data type for ML use cases. It's not the same thing as an array or a list and it's not supposed to be.

While it's true that it would be possible to build a vector type on top of an array type, it's not necessary to do it that way, and given the lack of interest in an array type for its own sake I don't see why we would want to make that a requirement.

It's relevant that pgvector, which among the systems offering vector search is based on the most similar system to Cassandra in terms of its query language, adds a vector data type that only supports floats *even though postgresql already has an array data type* because the semantics are different. Random access doesn't make sense, string and collection and other datatypes don't make sense, typical ordered indexes don't make sense, etc. It's just a different beast from arrays, for a different use case.

On Fri, Apr 28, 2023 at 10:40 AM Benedict <[email protected]> wrote:
But you’re proposing introducing a general purpose type - this isn’t an ML plug-in, it’s modifying the core language in a manner that makes targeting your workload easier. Which is fine, but that means you have to consider its impact on the general language, not just your target use case.

On 28 Apr 2023, at 16:29, Jonathan Ellis <[email protected]> wrote:

That's exactly right.

In particular it makes no sense at all from an ML perspective to have vector types of anything other than numerics. And as I mentioned in the POC thread (but I did not mention here), float is overwhelmingly the most frequently used vector type, to the point that Pinecone (by far the most popular vector search engine) ONLY supports that type.

Lucene and Elastic also add support for vectors of bytes (8-bit ints), which are useful for optimizing models that you have already built with floats, but we have no reasonable path towards supporting indexing and searches against any other vector type.

So in order of what makes sense to me:

1. Add a vector type for just floats; consider adding bytes later if demand materializes. This gives us 99% of the value and limits the scope so we can deliver quickly.

2. Add a vector type for floats or bytes. This gives us another 1% of value in exchange for an extra 20% or so of effort.

3. Add a vector type for all numeric primitives, but you can only index floats and bytes. I think this is confusing to users and a bad idea.

4. Add a vector type that composes with all Cassandra types. I can't see a reason to do this, nobody wants it, and we killed the most similar proposal in the past as wontfix.

On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie <[email protected]> wrote:
From a machine learning perspective, vectors are a well-known concept that are effectively immutable fixed-length n-dimensional values that are then later used either as part of a model or in conjunction with a model after the fact.

While we could have this be non-frozen and not call it a vector, I'd be inclined to still make the argument for a layer of syntactic sugar on top that met ML users where they were with concepts they understood rather than forcing them through the cognitive lift of figuring out the Cassandra specific contortions to replicate something that's ubiquitous in their space. We did the same "Cassandra-first" approach with our JSON support and that didn't do us any favors in terms of adoption and usage as far as I know.

So is the goal here to provide something specific and idiomatic for the ML community or is the goal to make a primitive that's C*-centric that then another layer can write to? I personally argue for the former; I don't see this specific data type going away any time soon.

On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
but as you point out it has the problem of allowing nulls.

If nulls are not allowed for the elements, then either we need a) a new type, or b) add some way to say elements may not be null…. As much as I do like b, I am leaning towards new type for this use case.

So, to flesh out the type requirements I have seen so far

1) represents a fixed size array of element type
* on write path we will need to validate this
2) element may not be null
* on write path we will need to validate this
3) “frozen” (is this really a requirement for the type or is this just simpler for the ANN work? I feel that this shouldn’t be a requirement)
4) works for all types (my requirement; original proposal is float only, but could logically expand to primitive types)

Anything else?

The key thing about a vector is that unlike lists or tuples you really don't care about individual elements, you care about doing vector and matrix multiplications with the thing as a unit.

That maybe true for this use case, but “should” this be true for the type itself? I feel like no… if a user wants the Nth element of a vector why would we block them? I am not saying the first patch, or even 5.0 adds support for index access, I am just trying to push back saying that the type should not block this.

(Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT VECTOR[N].)

Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I prefer this syntax but that limitation may not be desired for all use cases… we could always add LIST<TYPE, N> and ARRAY<TYPE, N> later to address that case.

In terms of syntax I have seen, here is my ordered preference:

1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this semantic…. Could even be NON NULL TYPE[size]

On Apr 27, 2023, at 9:00 AM, Benedict <[email protected]> wrote:

That’s a bounded ring buffer, not a fixed length array.

This definitely isn’t a tuple because the types are all the same, which is pretty crucial for matrix operations. Matrix libraries generally work on arrays of known dimensionality, or sparse representations.

Whether we draw any semantic link between the frozen list and whatever we do here, it is fundamentally a frozen list with a restriction on its size. What we’re defining here are “statically” sized arrays, whereas a frozen list is essentially a dynamically sized array.

I do not think vector is a good name because vector is used in some other popular languages to mean a (dynamic) list, which is confusing when we also have a list concept.

I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with list. Though it is a bit strange that this particular type declaration looks so different to other collection types.

On 27 Apr 2023, at 16:48, Jeff Jirsa <[email protected]> wrote:

On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis <[email protected]> wrote:
It's been a while, so I may be missing something, but do we already have fixed-size lists? If not, I don't see why we'd try to make this fit into a List-shaped problem.

We do not. The proposal got closed as wont-fix https://issues.apache.org/jira/browse/CASSANDRA-9110

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] New data type for vector search

Reply via email to