[GitHub] [cassandra] adelapena commented on a diff in pull request #2673: CASSANDRA-18715 Add support for a vector search index in SAI

via GitHub Fri, 15 Sep 2023 05:42:45 -0700


adelapena commented on code in PR #2673:
URL: https://github.com/apache/cassandra/pull/2673#discussion_r1327183445



##########
src/java/org/apache/cassandra/cql3/restrictions/StatementRestrictions.java:
##########
@@ -276,8 +290,25 @@ else if (relation.isLIKE())
             }
             if (hasQueriableIndex)
                 usesSecondaryIndexing = true;
-            else if (!allowFiltering && requiresAllowFilteringIfNotSpecified())
-                throw invalidRequest(allowFilteringMessage(state));
+            else
+            {
+                if (nonPrimaryKeyRestrictions.hasAnn())
+                {
+                    var vectorColumn = 
nonPrimaryKeyRestrictions.getColumnDefs().stream().filter(c -> 
c.type.isVector()).findFirst();

Review Comment:
   Can't we have multiple vector columns, so the first vector column is not the 
indexed one? Do we have any tests covering this?



##########
src/java/org/apache/cassandra/cql3/restrictions/StatementRestrictions.java:
##########
@@ -276,8 +290,25 @@ else if (relation.isLIKE())
             }
             if (hasQueriableIndex)
                 usesSecondaryIndexing = true;
-            else if (!allowFiltering && requiresAllowFilteringIfNotSpecified())
-                throw invalidRequest(allowFilteringMessage(state));
+            else

Review Comment:
   Can we add braces to the `if` block above, so it matches the braced `else` 
block?



##########
src/java/org/apache/cassandra/db/ReadQuery.java:
##########
@@ -256,4 +256,12 @@ default void maybeValidateIndex()
     default void trackWarnings()
     {
     }
+
+    /**
+     * @return true given read query is a top-k request

Review Comment:
   ```suggestion
        * @return {@code true} if this is a top-k query
   ```



##########
src/java/org/apache/cassandra/db/ColumnFamilyStore.java:
##########
@@ -1456,6 +1457,12 @@ public void apply(PartitionUpdate update, 
CassandraWriteContext context, boolean
         }
         catch (RuntimeException e)
         {
+            if (e instanceof InvalidRequestException)
+            {
+                throw new InvalidRequestException(e.getMessage()

Review Comment:
   The message construction is identical for `InvalidRequestException` and 
`RuntimeException`. We could use:
   ```java
   catch (RuntimeException e)
   {
       String msg = e.getMessage() + " for ks: " + getKeyspaceName() + ", 
table: " + name;
   
       if (e instanceof InvalidRequestException)
           throw new InvalidRequestException(msg, e);
       
       throw new RuntimeException(msg, e);
   }
   ```



##########
src/java/org/apache/cassandra/cql3/restrictions/StatementRestrictions.java:
##########
@@ -476,6 +507,38 @@ public boolean usesSecondaryIndexing()
         return this.usesSecondaryIndexing;
     }
 
+    /**
+     * This is a hack to push ordering down to indexes.
+     * Indexes are selected based on RowFilter only, so we need to turn 
orderings into restrictions
+     * so they end up in the row filter.
+     *
+     * @param orderings orderings from the select statement
+     * @return the {@link RestrictionSet} with the added orderings
+     */
+    private RestrictionSet addOrderingRestrictions(List<Ordering> orderings, 
RestrictionSet restrictionSet)
+    {
+        List<Ordering> annOrderings = orderings.stream().filter(o -> 
o.expression.hasNonClusteredOrdering()).collect(Collectors.toList());
+
+        if (annOrderings.size() > 1)
+            throw new InvalidRequestException("Cannot specify more than one 
ANN ordering");
+        else if (annOrderings.size() == 1)
+        {
+            if (orderings.size() > 1)
+                throw new InvalidRequestException("ANN ordering does not 
support secondary ordering");

Review Comment:
   I would say something like `ANN ordering does not support any other 
ordering`, since ANN could be considered the secondary ordering.



##########
src/java/org/apache/cassandra/cql3/statements/SelectStatement.java:
##########
@@ -1573,6 +1656,39 @@ public int compare(List<ByteBuffer> a, List<ByteBuffer> 
b)
         }
     }
 
+    private static class IndexColumnComparator<T> extends 
ColumnComparator<List<ByteBuffer>>

Review Comment:
   The generic (`<T>`) doesn't seem to be used.



##########
src/java/org/apache/cassandra/db/marshal/VectorType.java:
##########
@@ -365,6 +366,7 @@ public abstract class VectorSerializer extends 
TypeSerializer<List<T>>
 
         public abstract <V> List<V> split(V buffer, ValueAccessor<V> accessor);
         public abstract <V> V serializeRaw(List<V> elements, ValueAccessor<V> 
accessor);
+        public abstract float[] deserializeFloatArray(ByteBuffer input);

Review Comment:
   Does this method duplicate `composeAsFloat`?



##########
src/java/org/apache/cassandra/cql3/restrictions/StatementRestrictions.java:
##########
@@ -276,8 +290,25 @@ else if (relation.isLIKE())
             }
             if (hasQueriableIndex)
                 usesSecondaryIndexing = true;
-            else if (!allowFiltering && requiresAllowFilteringIfNotSpecified())
-                throw invalidRequest(allowFilteringMessage(state));
+            else
+            {
+                if (nonPrimaryKeyRestrictions.hasAnn())
+                {
+                    var vectorColumn = 
nonPrimaryKeyRestrictions.getColumnDefs().stream().filter(c -> 
c.type.isVector()).findFirst();
+                    if (vectorColumn.isPresent())
+                    {
+                        var vc = vectorColumn.get();
+                        var hasIndex = 
indexRegistry.listIndexes().stream().anyMatch(i -> i.dependsOn(vc));

Review Comment:
   Can we add an assertion verifying that `indexRegistry` is not null to 
prevent the warning?



##########
src/java/org/apache/cassandra/cql3/statements/SelectStatement.java:
##########
@@ -822,7 +825,7 @@ private DataLimits getDataLimits(int userLimit,
         // If we do post ordering we need to get all the results sorted before 
we can trim them.
         if (aggregationSpec != AggregationSpecification.AGGREGATE_EVERYTHING)
         {
-            if (!needsPostQueryOrdering())
+            if (!needsToSkipUserLimit())

Review Comment:
   I think it's the only usage of `needsToSkipUserLimit`, so it can probably be 
inlined here.



##########
src/java/org/apache/cassandra/cql3/statements/SelectStatement.java:
##########
@@ -288,7 +291,7 @@ public ResultMessage.Rows execute(QueryState state, 
QueryOptions options, long q
             query.trackWarnings();
         ResultMessage.Rows rows;
 
-        if (aggregationSpec == null && (pageSize <= 0 || 
(query.limits().count() <= pageSize)))
+        if (aggregationSpec == null && (pageSize <= 0 || 
(query.limits().count() <= pageSize) || query.isTopK()))

Review Comment:
   `StorageAttachedIndex#validate` makes sure that we cannot use ANN without a 
`LIMIT`. However, paging is disabled in the CQL layer, so any index 
implementation could provide support for ANN. I think that the existence of a 
`LIMIT` should be checked here in `SelectStatement`.
   
   As for the specific limit, 100 by default, I guess it could also be checked 
on the generic CQL layer because it affects the size of the pages. Should it 
also be checked by the specific index implementation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [cassandra] adelapena commented on a diff in pull request #2673: CASSANDRA-18715 Add support for a vector search index in SAI

Reply via email to