>
> So is the goal here to provide something specific and idiomatic for the ML
> community or is the goal to make a primitive that's C*-centric that then
> another layer can write to? I personally argue for the former; I don't see
> this specific data type going away any time soon.


+1 on this concept. We could invite an entirely new class of users into
Cassandra by using familiar syntax. I was surprised that DENSE got nuked so
quickly since it is used in the ML world. [1][2][3]

Patrick

1.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.linalg.DenseVector.html
2. https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense
3. https://www.pinecone.io/learn/dense-vector-embeddings-nlp/

On Thu, Apr 27, 2023 at 5:49 PM Josh McKenzie <jmcken...@apache.org> wrote:

> From a machine learning perspective, vectors are a well-known concept that
> are effectively immutable fixed-length n-dimensional values that are then
> later used either as part of a model or in conjunction with a model after
> the fact.
>
> While we could have this be non-frozen and not call it a vector, I'd be
> inclined to still make the argument for a layer of syntactic sugar on top
> that met ML users where they were with concepts they understood rather than
> forcing them through the cognitive lift of figuring out the Cassandra
> specific contortions to replicate something that's ubiquitous in their
> space. We did the same "Cassandra-first" approach with our JSON support and
> that didn't do us any favors in terms of adoption and usage as far as I
> know.
>
> So is the goal here to provide something specific and idiomatic for the ML
> community or is the goal to make a primitive that's C*-centric that then
> another layer can write to? I personally argue for the former; I don't see
> this specific data type going away any time soon.
>
> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>
> but as you point out it has the problem of allowing nulls.
>
>
> If nulls are not allowed for the elements, then either we need  a) a new
> type, or b) add some way to say elements may not be null…. As much as I do
> like b, I am leaning towards new type for this use case.
>
> So, to flesh out the type requirements I have seen so far
>
> 1) represents a fixed size array of element type
> * on write path we will need to validate this
> 2) element may not be null
> * on write path we will need to validate this
> 3) “frozen” (is this really a requirement for the type or is this
> just simpler for the ANN work?  I feel that this shouldn’t be a requirement)
> 4) works for all types (my requirement; original proposal is float only,
> but could logically expand to primitive types)
>
> Anything else?
>
> The key thing about a vector is that unlike lists or tuples you really
> don't care about individual elements, you care about doing vector and
> matrix multiplications with the thing as a unit.
>
>
> That maybe true for this use case, but “should” this be true for the type
> itself?  I feel like no… if a user wants the Nth element of a vector why
> would we block them?  I am not saying the first patch, or even 5.0 adds
> support for index access, I am just trying to push back saying that the
> type should not block this.
>
> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT
> VECTOR[N].)
>
>
> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I
> prefer this syntax but that limitation may not be desired for all use
> cases… we could always add LIST<TYPE, N> and ARRAY<TYPE, N> later
> to address that case.
>
> In terms of syntax I have seen, here is my ordered preference:
>
> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this
> semantic…. Could even be NON NULL TYPE[size]
>
> On Apr 27, 2023, at 9:00 AM, Benedict <bened...@apache.org> wrote:
>
>
> That’s a bounded ring buffer, not a fixed length array.
>
> This definitely isn’t a tuple because the types are all the same, which is
> pretty crucial for matrix operations. Matrix libraries generally work on
> arrays of known dimensionality, or sparse representations.
>
> Whether we draw any semantic link between the frozen list and whatever we
> do here, it is fundamentally a frozen list with a restriction on its size.
> What we’re defining here are “statically” sized arrays, whereas a frozen
> list is essentially a dynamically sized array.
>
> I do not think vector is a good name because vector is used in some other
> popular languages to mean a (dynamic) list, which is confusing when we also
> have a list concept.
>
> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link
> with list. Though it is a bit strange that this particular type declaration
> looks so different to other collection types.
>
> On 27 Apr 2023, at 16:48, Jeff Jirsa <jji...@gmail.com> wrote:
>
> 
>
>
> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis <jbel...@gmail.com> wrote:
>
> It's been a while, so I may be missing something, but do we already have
> fixed-size lists?  If not, I don't see why we'd try to make this fit into a
> List-shaped problem.
>
>
> We do not. The proposal got closed as wont-fix
> https://issues.apache.org/jira/browse/CASSANDRA-9110
>
>
>
>

Reply via email to