Re: [PR] CASSANDRA-20189 Avoid possible consistency violations for SAI intersection queries over repaired index matches and multiple non-indexed column matches [cassandra]

via GitHub Tue, 28 Jan 2025 13:36:14 -0800


dcapwell commented on code in PR #3779:
URL: https://github.com/apache/cassandra/pull/3779#discussion_r1932860508



##########
src/java/org/apache/cassandra/index/sai/plan/Operation.java:
##########
@@ -63,11 +66,58 @@ public boolean apply(boolean a, boolean b)
         }
     }
 
+    public static class Expressions
+    {
+        final ListMultimap<ColumnMetadata, Expression> expressions;
+        final Set<ColumnMetadata> unindexedColumns;
+
+        Expressions(ListMultimap<ColumnMetadata, Expression> expressions, 
Set<ColumnMetadata> unindexedColumns)
+        {
+            this.expressions = expressions;
+            this.unindexedColumns = unindexedColumns;
+        }
+
+        Set<ColumnMetadata> columns()
+        {
+            return expressions.keySet();
+        }
+
+        Collection<Expression> all()
+        {
+            return expressions.values();
+        }
+
+        List<Expression> expressionsFor(ColumnMetadata column)
+        {
+            return expressions.get(column);
+        }
+
+        boolean isEmpty()
+        {
+            return expressions.isEmpty();
+        }
+
+        int size()
+        {
+            return expressions.size();
+        }
+
+        boolean isUnindexed(ColumnMetadata column)
+        {
+            return unindexedColumns.contains(column);
+        }
+
+        boolean hasMultipleUnindexedColumns()
+        {
+            return unindexedColumns.size() > 1;
+        }
+    }
+
     @VisibleForTesting
-    protected static ListMultimap<ColumnMetadata, Expression> 
buildIndexExpressions(QueryController queryController,
-                                                                               
     List<RowFilter.Expression> expressions)
+    protected static Expressions buildIndexExpressions(QueryController 
queryController, List<RowFilter.Expression> expressions)
     {
         ListMultimap<ColumnMetadata, Expression> analyzed = 
ArrayListMultimap.create();
+        Set<ColumnMetadata> unindexedColumns = new HashSet<>(3);

Review Comment:
   if we assume this isn't the norm, we could avoid allocating this until its 
actually needed?



##########
test/distributed/org/apache/cassandra/fuzz/sai/SingleNodeSAITestBase.java:
##########
@@ -156,20 +187,28 @@ private void basicSaiTest(EntropySource rng, SchemaSpec 
schema)
                                                                                
             .pageSizeSelector(pageSizeSelector(rng))
                                                                                
             .consistencyLevel(consistencyLevelSelector())
                                                                                
             .doubleWriting(schema, hb, cluster, "debug_table"));
-        List<Integer> partitions = new ArrayList<>();
-        for (int j = 0; j < 5; j++)
-        {
-            int picked = globalPkGen.generate(rng);
-            if (usedPartitions.contains(picked))
-                continue;
-            partitions.add(picked);
-        }
+        Set<Integer> partitions = new HashSet<>();
+        while (partitions.size() < NUM_VISITED_PARTITIONS)

Review Comment:
   what is the reason for going from `5` to `16`?  What if we used the `rng` to 
pick the number of partitions?



##########
test/distributed/org/apache/cassandra/fuzz/sai/SingleNodeSAITestBase.java:
##########
@@ -136,17 +164,20 @@ private void basicSaiTest(EntropySource rng, SchemaSpec 
schema)
         cluster.forEach(i -> i.nodetool("disableautocompaction"));
 
         cluster.schemaChange(schema.compile());
-        cluster.schemaChange(schema.compile().replace(schema.keyspace + "." + 
schema.table,
-                                                      schema.keyspace + 
".debug_table"));
-        Streams.concat(schema.clusteringKeys.stream(),
-                       schema.regularColumns.stream(),
-                       schema.staticColumns.stream())
+        cluster.schemaChange(schema.compile().replace(schema.keyspace + '.' + 
schema.table, schema.keyspace + ".debug_table"));
+
+        Streams.concat(schema.clusteringKeys.stream(), 
schema.regularColumns.stream(), schema.staticColumns.stream())
                .forEach(column -> {
+                   if (createIndex.get())

Review Comment:
   feel free to ignore, but this might be better as
   
   ```
   if (createIndex.test(rng, column))
   ```
   
   you can still do the `(i1, i2) -> true` that you do in most of the tests, 
but does give you the flexibility for more randomized coverage



##########
test/distributed/org/apache/cassandra/fuzz/sai/SingleNodeSAITestBase.java:
##########
@@ -104,28 +111,49 @@ public static void afterClass()
     @Before
     public void beforeEach()
     {
-        cluster.schemaChange("DROP KEYSPACE IF EXISTS harry");
-        cluster.schemaChange("CREATE KEYSPACE harry WITH replication = 
{'class': 'SimpleStrategy', 'replication_factor': 1};");
+        cluster.schemaChange("DROP KEYSPACE IF EXISTS " + KEYSPACE);
+        cluster.schemaChange("CREATE KEYSPACE " + KEYSPACE + " WITH 
replication = {'class': 'SimpleStrategy', 'replication_factor': 1};");

Review Comment:
   feel free to ignore for this PR, but wouldn't it be easier to maintain this 
logic with regard to multi node if `rf()` was a function? 



##########
test/distributed/org/apache/cassandra/fuzz/sai/SingleNodeSAITestBase.java:
##########
@@ -193,33 +232,50 @@ private void basicSaiTest(EntropySource rng, SchemaSpec 
schema)
             if (rng.nextFloat() > 0.9995f)
                 history.deletePartition(partitionIndex);
 
-            if (i % REPAIR_SKIP == 0)
-                history.custom(() -> repair(schema), "Repair");
-            else if (i % FLUSH_SKIP == 0)
+            if (i % FLUSH_SKIP == 0)
                 history.custom(() -> flush(schema), "Flush");
             else if (i % COMPACTION_SKIP == 0)
                 history.custom(() -> compact(schema), "Compact");
+            else if (i % repairSkip == 0)
+                history.custom(() -> repair(schema), "Repair");
 
-            if (i > 0 && i % 1000 == 0)
+            if (i > 0 && i % VALIDATION_SKIP == 0)
             {
-                for (int j = 0; j < 5; j++)
+                for (int j = 0; j < QUERIES_PER_VALIDATION; j++)

Review Comment:
   why go from `5` to `8`?



##########
src/java/org/apache/cassandra/index/sai/utils/IndexTermType.java:
##########
@@ -319,6 +319,11 @@ public boolean dependsOn(ColumnMetadata columnMetadata)
         return this.columnMetadata.compareTo(columnMetadata) == 0;
     }
 
+    public static boolean isEqOnlyType(AbstractType<?> type)

Review Comment:
   can we update the ast package to use this?  Can drop 
`org.apache.cassandra.cql3.ast.CreateIndexDDL#SAI_EQ_ONLY`  and update 
`org.apache.cassandra.cql3.ast.CreateIndexDDL.Indexer#supportedQueries` to call 
this method directly?
   
   ```
   $ git diff test/unit/org/apache/cassandra/cql3/ast/CreateIndexDDL.java
   diff --git a/test/unit/org/apache/cassandra/cql3/ast/CreateIndexDDL.java 
b/test/unit/org/apache/cassandra/cql3/ast/CreateIndexDDL.java
   index 252dd99bc9..666796647b 100644
   --- a/test/unit/org/apache/cassandra/cql3/ast/CreateIndexDDL.java
   +++ b/test/unit/org/apache/cassandra/cql3/ast/CreateIndexDDL.java
   @@ -36,6 +36,7 @@ import org.apache.cassandra.db.marshal.CompositeType;
    import org.apache.cassandra.db.marshal.UTF8Type;
    import org.apache.cassandra.db.marshal.UUIDType;
    import org.apache.cassandra.index.sai.StorageAttachedIndex;
   +import org.apache.cassandra.index.sai.utils.IndexTermType;
    import org.apache.cassandra.schema.ColumnMetadata;
    import org.apache.cassandra.schema.TableMetadata;
   
   @@ -150,7 +151,7 @@ public class CreateIndexDDL implements Element
            public EnumSet<QueryType> supportedQueries(AbstractType<?> type)
            {
                type = type.unwrap();
   -            if (SAI_EQ_ONLY.contains(type))
   +            if (IndexTermType.isEqOnlyType(type))
                    return EnumSet.of(QueryType.Eq);
                return EnumSet.allOf(QueryType.class);
            }
   ```
   



##########
test/distributed/org/apache/cassandra/fuzz/sai/SingleNodeSAITestBase.java:
##########
@@ -193,33 +232,50 @@ private void basicSaiTest(EntropySource rng, SchemaSpec 
schema)
             if (rng.nextFloat() > 0.9995f)
                 history.deletePartition(partitionIndex);
 
-            if (i % REPAIR_SKIP == 0)
-                history.custom(() -> repair(schema), "Repair");
-            else if (i % FLUSH_SKIP == 0)
+            if (i % FLUSH_SKIP == 0)
                 history.custom(() -> flush(schema), "Flush");
             else if (i % COMPACTION_SKIP == 0)
                 history.custom(() -> compact(schema), "Compact");
+            else if (i % repairSkip == 0)
+                history.custom(() -> repair(schema), "Repair");
 
-            if (i > 0 && i % 1000 == 0)
+            if (i > 0 && i % VALIDATION_SKIP == 0)

Review Comment:
   what motivated the change from `1k` to `739`?



##########
test/distributed/org/apache/cassandra/fuzz/sai/SingleNodeSAITestBase.java:
##########
@@ -193,33 +232,50 @@ private void basicSaiTest(EntropySource rng, SchemaSpec 
schema)
             if (rng.nextFloat() > 0.9995f)
                 history.deletePartition(partitionIndex);
 
-            if (i % REPAIR_SKIP == 0)
-                history.custom(() -> repair(schema), "Repair");
-            else if (i % FLUSH_SKIP == 0)
+            if (i % FLUSH_SKIP == 0)
                 history.custom(() -> flush(schema), "Flush");
             else if (i % COMPACTION_SKIP == 0)
                 history.custom(() -> compact(schema), "Compact");
+            else if (i % repairSkip == 0)
+                history.custom(() -> repair(schema), "Repair");
 
-            if (i > 0 && i % 1000 == 0)
+            if (i > 0 && i % VALIDATION_SKIP == 0)
             {
-                for (int j = 0; j < 5; j++)
+                for (int j = 0; j < QUERIES_PER_VALIDATION; j++)
                 {
-                    List<IdxRelation> regularRelations = 
HistoryBuilderHelper.generateValueRelations(rng, schema.regularColumns.size(),
-                                                                               
                      column -> 
Math.min(schema.valueGenerators.regularPopulation(column), UNIQUE_CELL_VALUES));
-                    List<IdxRelation> staticRelations = 
HistoryBuilderHelper.generateValueRelations(rng, schema.staticColumns.size(),
-                                                                               
                     column -> 
Math.min(schema.valueGenerators.staticPopulation(column), UNIQUE_CELL_VALUES));
-                    history.select(pkGen.generate(rng),
-                                   
HistoryBuilderHelper.generateClusteringRelations(rng, 
schema.clusteringKeys.size(), ckGen).toArray(new IdxRelation[0]),
-                                   regularRelations.toArray(new 
IdxRelation[regularRelations.size()]),
-                                   staticRelations.toArray(new 
IdxRelation[staticRelations.size()]));
+                    List<IdxRelation> regularRelations = 
+                            HistoryBuilderHelper.generateValueRelations(rng,
+                                                                        
schema.regularColumns.size(),
+                                                                        column 
-> Math.min(schema.valueGenerators.regularPopulation(column), 
UNIQUE_CELL_VALUES),
+                                                                        
eqOnlyRegularColumns);

Review Comment:
   @ifesdjeen given the current pattern, wouldn't this be better as a 
`Predicate` rather than a set?
   
   We have `Function<Integer, Integer> population`, so maybe adding
   
   ```
   IntPredicate eqOnly
   ```



##########
test/harry/main/org/apache/cassandra/harry/dsl/HistoryBuilderHelper.java:
##########
@@ -113,19 +118,24 @@ public static void deleteRandomColumns(SchemaSpec schema, 
int partitionIdx, int
      * @param population - expected population / number of possible values for 
a given column
      * @return a list of relations
      */
-    public static List<SingleOperationBuilder.IdxRelation> 
generateValueRelations(EntropySource rng, int numColumns, Function<Integer, 
Integer> population)
+    public static List<SingleOperationBuilder.IdxRelation> 
generateValueRelations(EntropySource rng, int numColumns, Function<Integer, 
Integer> population, Set<Integer> eqOnlyColumns)

Review Comment:
   left this comment at the usage, but `IntPredicate` might be better for this 
api?  `eqOnlyColumns.contains(column)` could become `eqOnly.test(column)`



##########
test/distributed/org/apache/cassandra/fuzz/sai/SingleNodeSAITestBase.java:
##########
@@ -156,20 +187,28 @@ private void basicSaiTest(EntropySource rng, SchemaSpec 
schema)
                                                                                
             .pageSizeSelector(pageSizeSelector(rng))
                                                                                
             .consistencyLevel(consistencyLevelSelector())
                                                                                
             .doubleWriting(schema, hb, cluster, "debug_table"));
-        List<Integer> partitions = new ArrayList<>();
-        for (int j = 0; j < 5; j++)
-        {
-            int picked = globalPkGen.generate(rng);
-            if (usedPartitions.contains(picked))
-                continue;
-            partitions.add(picked);
-        }
+        Set<Integer> partitions = new HashSet<>();
+        while (partitions.size() < NUM_VISITED_PARTITIONS)

Review Comment:
   for me: need to confirm the new tests don't have a possibility of creating a 
schema where the number of possible unique partitions is `< 
NUM_VISITED_PARTITIONS`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] CASSANDRA-20189 Avoid possible consistency violations for SAI intersection queries over repaired index matches and multiple non-indexed column matches [cassandra]

Reply via email to