Re: [PR] Post 19437 live migration for reads and fixes [cassandra]

via GitHub Tue, 14 Jan 2025 15:15:45 -0800


dcapwell commented on code in PR #3777:
URL: https://github.com/apache/cassandra/pull/3777#discussion_r1915695731



##########
src/java/org/apache/cassandra/service/accord/AccordTopology.java:
##########
@@ -280,7 +294,9 @@ public static Topology createAccordTopology(Epoch epoch, 
DistributedSchema schem
             List<TableMetadata> tables = 
keyspace.tables.stream().filter(TableMetadata::requiresAccordSupport).collect(Collectors.toList());
             if (tables.isEmpty())
                 continue;
+            System.out.println("In epoch " + epoch + " Accord tables are " + 
tables);

Review Comment:
   please remove



##########
src/java/org/apache/cassandra/service/accord/txn/AbstractKeySorted.java:
##########
@@ -51,17 +58,63 @@ public AbstractKeySorted(List<T> items)
     {
         T[] arr = newArray(items.size());
         items.toArray(arr);
-        Arrays.sort(arr, this::compare);
         this.items = arr;
+        if (items.size() == 0)
+        {
+            this.itemKeys = Keys.of();
+            return;
+        }
+        Domain domain = getKeys(arr[0]).domain();
+        switch (domain)
+        {
+            case Key:
+                Arrays.sort(arr, this::compareKey);
+                break;
+            case Range:
+                Arrays.sort(arr, this::compareRange);
+                break;
+            default:
+                throw new IllegalStateException("Unhandled domain " + domain);
+        }
         this.itemKeys = extractItemKeys();
     }
 
-    private Keys extractItemKeys()
+    private Seekables extractItemKeys()
     {
-        PartitionKey[] keys = new PartitionKey[items.length];
-        for (int i = 0 ; i < keys.length ; ++i)
-            keys[i] = getKey(items[i]);
-        return Keys.ofSorted(keys);
+        // TODO (review): This doesn't pick the "right" domain which could 
make Accord angry (in practice it doesn't)
+        // but I think the right track going forward is to be selectively 
forgiving of empty Seekables in the wrong
+        // domain in Accord for `Update.key()` and `Read.keys` to save 
implementations from having to propagate them
+        // or alternatively just allow null `Read` just like we allow null 
`Query` and null `Update`
+        if (items.length == 0)
+            return Keys.of();

Review Comment:
   isn't this only an issue if we had a code path that could do `Write Range`?  
Since we are `Read Range` only then `Key` for empty read feels like a fine 
thing?  This basically gives us `TxnRead.EMPTY`



##########
src/java/org/apache/cassandra/service/accord/AccordMessageSink.java:
##########
@@ -250,6 +250,8 @@ public void send(Node.Id to, Request request)
         Message<Request> message = Message.out(verb, request);
         InetAddressAndPort endpoint = endpointMapper.mappedEndpoint(to);
         logger.trace("Sending {} {} to {}", verb, message.payload, endpoint);
+        if (AccordService.DEBUG_LOG_MESSAGING)

Review Comment:
   can we remove this logic?  we can awlays do `nodetool setlogginglevel 
org.apache.cassandra.service.accord.AccordMessageSink TRACE` or similar... if 
this is for manual debugging you can also change the logger xml...



##########
src/java/org/apache/cassandra/service/accord/AccordService.java:
##########
@@ -195,6 +195,11 @@ public class AccordService implements IAccordService, 
Shutdownable
 {
     private static final Logger logger = 
LoggerFactory.getLogger(AccordService.class);
 
+    // Single flag to enable debug logging in AccordMessageSink, 
AccordVerbHandler
+    // These are always logged at trace, but it's easier to just flip this 
flag and see
+    // output in tests
+    public static final boolean DEBUG_LOG_MESSAGING = true;

Review Comment:
   can we remove this logic?  we can always do `nodetool setlogginglevel 
org.apache.cassandra.service.accord.AccordMessageSink TRACE` or similar... if 
this is for manual debugging you can also change the logger xml...



##########
src/java/org/apache/cassandra/service/accord/AccordTopology.java:
##########
@@ -280,7 +294,9 @@ public static Topology createAccordTopology(Epoch epoch, 
DistributedSchema schem
             List<TableMetadata> tables = 
keyspace.tables.stream().filter(TableMetadata::requiresAccordSupport).collect(Collectors.toList());
             if (tables.isEmpty())
                 continue;
+            System.out.println("In epoch " + epoch + " Accord tables are " + 
tables);
             List<KeyspaceShard> ksShards = KeyspaceShard.forKeyspace(keyspace, 
placements, directory);
+            System.out.println("Shards are " + shards);

Review Comment:
   please remove



##########
src/java/org/apache/cassandra/service/accord/interop/AccordInteropExecution.java:
##########
@@ -245,113 +266,113 @@ public void sendReadCommand(Message<ReadCommand> 
message, InetAddressAndPort to,
     public void sendReadRepairMutation(Message<Mutation> message, 
InetAddressAndPort to, RequestCallback<Object> callback)
     {
         checkArgument(message.payload.allowsPotentialTransactionConflicts());
+        checkArgument(message.payload.getTableIds().size() == 1);

Review Comment:
   what about the case where we do `BATCH` and touch 2 tables with the same 
token?
   
   ```
   BEGIN BATCH
     insert into tbl1 ... (pk=0, ...);
     insert into tbl2 ... (pk=0, ...);
   APPLY BATCH;
   ```



##########
src/java/org/apache/cassandra/service/accord/txn/TxnNamedRead.java:
##########
@@ -59,34 +84,88 @@ public class TxnNamedRead extends 
AbstractSerialized<ReadCommand>
     @SuppressWarnings("unused")
     private static final Logger logger = 
LoggerFactory.getLogger(TxnNamedRead.class);
 
-    private static final long EMPTY_SIZE = ObjectSizes.measure(new 
TxnNamedRead(0, null, null));
+    private static final long EMPTY_SIZE = ObjectSizes.measure(new 
TxnNamedRead(0, Keys.of(), (ByteBuffer) null));
 
     private final int name;
-    private final PartitionKey key;
+    private final Seekables<?, ?> keys;
 
     public TxnNamedRead(int name, @Nullable SinglePartitionReadCommand value)
     {
         super(value);
         this.name = name;
-        this.key = new PartitionKey(value.metadata().id, value.partitionKey());
+        this.keys = Keys.of(new PartitionKey(value.metadata().id, 
value.partitionKey()));
     }
 
-    public TxnNamedRead(int name, PartitionKey key, ByteBuffer bytes)
+    public static TokenRange 
boundsAsAccordRange(AbstractBounds<PartitionPosition> range, TableId tableId)
+    {
+        // Should already have been unwrapped
+        checkState(!AbstractBounds.strictlyWrapsAround(range.left, 
range.right));
+
+        // Read commands can contain a mix of different kinds of bounds to 
facilitate paging
+        // and we need to communicate that to Accord as its own ranges. This 
uses
+        // TokenKey, SentinelKey, and MinTokenKey and sticks exclusively with 
left exclusive/right inclusive
+        // ranges rather add more types of ranges to the mix
+        // MinTokenKey allows emulating inclusive left and exclusive right 
with Range
+        boolean inclusiveLeft = range.inclusiveLeft();
+        PartitionPosition startPP = range.left;
+        boolean startIsMinKeyBound = startPP.getClass() == KeyBound.class ? 
((KeyBound)startPP).isMinimumBound : false;
+        Token startToken = startPP.getToken();
+        AccordRoutingKey startAccordRoutingKey;
+        if (startToken.isMinimum() && inclusiveLeft)
+            startAccordRoutingKey = SentinelKey.min(tableId);
+        else if (inclusiveLeft || startIsMinKeyBound)
+            startAccordRoutingKey = new MinTokenKey(tableId, startToken);
+        else
+            startAccordRoutingKey = new TokenKey(tableId, startToken);
+
+        boolean inclusiveRight = range.inclusiveRight();
+        PartitionPosition endPP = range.right;
+        boolean endIsMinKeyBound = endPP.getClass() == KeyBound.class ? 
((KeyBound)endPP).isMinimumBound : false;
+        Token stopToken = range.right.getToken();
+        AccordRoutingKey stopAccordRoutingKey;
+        if (stopToken.isMinimum())
+            stopAccordRoutingKey = SentinelKey.max(tableId);
+        else if (inclusiveRight && !endIsMinKeyBound)
+            stopAccordRoutingKey = new TokenKey(tableId, stopToken);
+        else
+            stopAccordRoutingKey = new MinTokenKey(tableId, stopToken);
+        return TokenRange.create(startAccordRoutingKey, stopAccordRoutingKey);
+    }
+
+    public TxnNamedRead(int name, List<AbstractBounds<PartitionPosition>> 
ranges, PartitionRangeReadCommand value)

Review Comment:
   why is this a list?  that adds a lot of complexity to this and related 
logic, and makes me question the correctness of many assumptions as the types 
allow multiple values... all that boils down to this single constructor... 
which is only ever reached from 
`org.apache.cassandra.service.reads.range.RangeCommandIterator#executeAccord`
   
   ```
   StorageProxy.readWithAccord(cm, rangeCommand, 
ImmutableList.of(rangeCommand.dataRange().keyRange()), cl, requestTime);
   ```
   
   So in every code path we are a single value... so why add the complexity of 
`List`?  



##########
src/java/org/apache/cassandra/service/accord/AccordService.java:
##########
@@ -768,7 +775,17 @@ public TopologyManager topology()
     public @Nonnull TxnResult coordinate(long minEpoch, @Nonnull Txn txn, 
@Nonnull ConsistencyLevel consistencyLevel, @Nonnull Dispatcher.RequestTime 
requestTime)
     {
         AsyncTxnResult asyncTxnResult = coordinateAsync(minEpoch, txn, 
consistencyLevel, requestTime);
-        return getTxnResult(asyncTxnResult, txn.isWrite(), consistencyLevel, 
requestTime);
+        try
+        {
+            return getTxnResult(asyncTxnResult);
+        }
+        catch (TopologyMismatch e)
+        {
+            // For now assuming topology mismatch is caused by a race 
misrouting

Review Comment:
   > This was particularly painful in that the Accord txn didn't get to the 
point where it runs the code that would return the retry different system error
   
   then the correct thing to do is what you did in a previous patch 
CASSANDRA-20089, detect that case and await on the epoch... we know what epoch 
last modified a table and if accord doesn't know it yet, we can await that and 
avoid the costly retry logic and special handling of this exception... that 
also gets rid of bugs due to retrying when we should not...
   
   > If you have created the transaction then the coordinator has already 
decided the table exists at least once and will loop again and get the cluster 
metadata and then find the table doesn't exist attempting to route it because 
getTableMetadata will throw IRE.
   
   you can get past that validation with the following, due to race 
conditions...
   
   ```
   insert into tbl ... IF NOT EXISTS;
   drop table tbl;
   ```
   
   this pattern has really bad consequences with this patch as it keeps trying 
to handle this exception... when the correct answer to solve the problem you 
are trying to solve is to wait for accord to know about the epoch...
   
   
   See
   
   ```
   commit c4fef9cdd0b8cc7034f18d45a7bde349ed723b17
   Author: Ariel Weisberg <[email protected]>
   Date:   Thu Nov 14 14:00:30 2024 -0500
   
       Fix Accord SAI tests and Accord double apply
   
       Patch by Ariel Weisberg and David Capwell; Reviewed by David Capwell for 
CASSANDRA-20089
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Post 19437 live migration for reads and fixes [cassandra]

Reply via email to