dcapwell commented on code in PR #3777:
URL: https://github.com/apache/cassandra/pull/3777#discussion_r1915695731
##########
src/java/org/apache/cassandra/service/accord/AccordTopology.java:
##########
@@ -280,7 +294,9 @@ public static Topology createAccordTopology(Epoch epoch,
DistributedSchema schem
List<TableMetadata> tables =
keyspace.tables.stream().filter(TableMetadata::requiresAccordSupport).collect(Collectors.toList());
if (tables.isEmpty())
continue;
+ System.out.println("In epoch " + epoch + " Accord tables are " +
tables);
Review Comment:
please remove
##########
src/java/org/apache/cassandra/service/accord/txn/AbstractKeySorted.java:
##########
@@ -51,17 +58,63 @@ public AbstractKeySorted(List<T> items)
{
T[] arr = newArray(items.size());
items.toArray(arr);
- Arrays.sort(arr, this::compare);
this.items = arr;
+ if (items.size() == 0)
+ {
+ this.itemKeys = Keys.of();
+ return;
+ }
+ Domain domain = getKeys(arr[0]).domain();
+ switch (domain)
+ {
+ case Key:
+ Arrays.sort(arr, this::compareKey);
+ break;
+ case Range:
+ Arrays.sort(arr, this::compareRange);
+ break;
+ default:
+ throw new IllegalStateException("Unhandled domain " + domain);
+ }
this.itemKeys = extractItemKeys();
}
- private Keys extractItemKeys()
+ private Seekables extractItemKeys()
{
- PartitionKey[] keys = new PartitionKey[items.length];
- for (int i = 0 ; i < keys.length ; ++i)
- keys[i] = getKey(items[i]);
- return Keys.ofSorted(keys);
+ // TODO (review): This doesn't pick the "right" domain which could
make Accord angry (in practice it doesn't)
+ // but I think the right track going forward is to be selectively
forgiving of empty Seekables in the wrong
+ // domain in Accord for `Update.key()` and `Read.keys` to save
implementations from having to propagate them
+ // or alternatively just allow null `Read` just like we allow null
`Query` and null `Update`
+ if (items.length == 0)
+ return Keys.of();
Review Comment:
isn't this only an issue if we had a code path that could do `Write Range`?
Since we are `Read Range` only then `Key` for empty read feels like a fine
thing? This basically gives us `TxnRead.EMPTY`
##########
src/java/org/apache/cassandra/service/accord/AccordMessageSink.java:
##########
@@ -250,6 +250,8 @@ public void send(Node.Id to, Request request)
Message<Request> message = Message.out(verb, request);
InetAddressAndPort endpoint = endpointMapper.mappedEndpoint(to);
logger.trace("Sending {} {} to {}", verb, message.payload, endpoint);
+ if (AccordService.DEBUG_LOG_MESSAGING)
Review Comment:
can we remove this logic? we can awlays do `nodetool setlogginglevel
org.apache.cassandra.service.accord.AccordMessageSink TRACE` or similar... if
this is for manual debugging you can also change the logger xml...
##########
src/java/org/apache/cassandra/service/accord/AccordService.java:
##########
@@ -195,6 +195,11 @@ public class AccordService implements IAccordService,
Shutdownable
{
private static final Logger logger =
LoggerFactory.getLogger(AccordService.class);
+ // Single flag to enable debug logging in AccordMessageSink,
AccordVerbHandler
+ // These are always logged at trace, but it's easier to just flip this
flag and see
+ // output in tests
+ public static final boolean DEBUG_LOG_MESSAGING = true;
Review Comment:
can we remove this logic? we can always do `nodetool setlogginglevel
org.apache.cassandra.service.accord.AccordMessageSink TRACE` or similar... if
this is for manual debugging you can also change the logger xml...
##########
src/java/org/apache/cassandra/service/accord/AccordTopology.java:
##########
@@ -280,7 +294,9 @@ public static Topology createAccordTopology(Epoch epoch,
DistributedSchema schem
List<TableMetadata> tables =
keyspace.tables.stream().filter(TableMetadata::requiresAccordSupport).collect(Collectors.toList());
if (tables.isEmpty())
continue;
+ System.out.println("In epoch " + epoch + " Accord tables are " +
tables);
List<KeyspaceShard> ksShards = KeyspaceShard.forKeyspace(keyspace,
placements, directory);
+ System.out.println("Shards are " + shards);
Review Comment:
please remove
##########
src/java/org/apache/cassandra/service/accord/interop/AccordInteropExecution.java:
##########
@@ -245,113 +266,113 @@ public void sendReadCommand(Message<ReadCommand>
message, InetAddressAndPort to,
public void sendReadRepairMutation(Message<Mutation> message,
InetAddressAndPort to, RequestCallback<Object> callback)
{
checkArgument(message.payload.allowsPotentialTransactionConflicts());
+ checkArgument(message.payload.getTableIds().size() == 1);
Review Comment:
what about the case where we do `BATCH` and touch 2 tables with the same
token?
```
BEGIN BATCH
insert into tbl1 ... (pk=0, ...);
insert into tbl2 ... (pk=0, ...);
APPLY BATCH;
```
##########
src/java/org/apache/cassandra/service/accord/txn/TxnNamedRead.java:
##########
@@ -59,34 +84,88 @@ public class TxnNamedRead extends
AbstractSerialized<ReadCommand>
@SuppressWarnings("unused")
private static final Logger logger =
LoggerFactory.getLogger(TxnNamedRead.class);
- private static final long EMPTY_SIZE = ObjectSizes.measure(new
TxnNamedRead(0, null, null));
+ private static final long EMPTY_SIZE = ObjectSizes.measure(new
TxnNamedRead(0, Keys.of(), (ByteBuffer) null));
private final int name;
- private final PartitionKey key;
+ private final Seekables<?, ?> keys;
public TxnNamedRead(int name, @Nullable SinglePartitionReadCommand value)
{
super(value);
this.name = name;
- this.key = new PartitionKey(value.metadata().id, value.partitionKey());
+ this.keys = Keys.of(new PartitionKey(value.metadata().id,
value.partitionKey()));
}
- public TxnNamedRead(int name, PartitionKey key, ByteBuffer bytes)
+ public static TokenRange
boundsAsAccordRange(AbstractBounds<PartitionPosition> range, TableId tableId)
+ {
+ // Should already have been unwrapped
+ checkState(!AbstractBounds.strictlyWrapsAround(range.left,
range.right));
+
+ // Read commands can contain a mix of different kinds of bounds to
facilitate paging
+ // and we need to communicate that to Accord as its own ranges. This
uses
+ // TokenKey, SentinelKey, and MinTokenKey and sticks exclusively with
left exclusive/right inclusive
+ // ranges rather add more types of ranges to the mix
+ // MinTokenKey allows emulating inclusive left and exclusive right
with Range
+ boolean inclusiveLeft = range.inclusiveLeft();
+ PartitionPosition startPP = range.left;
+ boolean startIsMinKeyBound = startPP.getClass() == KeyBound.class ?
((KeyBound)startPP).isMinimumBound : false;
+ Token startToken = startPP.getToken();
+ AccordRoutingKey startAccordRoutingKey;
+ if (startToken.isMinimum() && inclusiveLeft)
+ startAccordRoutingKey = SentinelKey.min(tableId);
+ else if (inclusiveLeft || startIsMinKeyBound)
+ startAccordRoutingKey = new MinTokenKey(tableId, startToken);
+ else
+ startAccordRoutingKey = new TokenKey(tableId, startToken);
+
+ boolean inclusiveRight = range.inclusiveRight();
+ PartitionPosition endPP = range.right;
+ boolean endIsMinKeyBound = endPP.getClass() == KeyBound.class ?
((KeyBound)endPP).isMinimumBound : false;
+ Token stopToken = range.right.getToken();
+ AccordRoutingKey stopAccordRoutingKey;
+ if (stopToken.isMinimum())
+ stopAccordRoutingKey = SentinelKey.max(tableId);
+ else if (inclusiveRight && !endIsMinKeyBound)
+ stopAccordRoutingKey = new TokenKey(tableId, stopToken);
+ else
+ stopAccordRoutingKey = new MinTokenKey(tableId, stopToken);
+ return TokenRange.create(startAccordRoutingKey, stopAccordRoutingKey);
+ }
+
+ public TxnNamedRead(int name, List<AbstractBounds<PartitionPosition>>
ranges, PartitionRangeReadCommand value)
Review Comment:
why is this a list? that adds a lot of complexity to this and related
logic, and makes me question the correctness of many assumptions as the types
allow multiple values... all that boils down to this single constructor...
which is only ever reached from
`org.apache.cassandra.service.reads.range.RangeCommandIterator#executeAccord`
```
StorageProxy.readWithAccord(cm, rangeCommand,
ImmutableList.of(rangeCommand.dataRange().keyRange()), cl, requestTime);
```
So in every code path we are a single value... so why add the complexity of
`List`?
##########
src/java/org/apache/cassandra/service/accord/AccordService.java:
##########
@@ -768,7 +775,17 @@ public TopologyManager topology()
public @Nonnull TxnResult coordinate(long minEpoch, @Nonnull Txn txn,
@Nonnull ConsistencyLevel consistencyLevel, @Nonnull Dispatcher.RequestTime
requestTime)
{
AsyncTxnResult asyncTxnResult = coordinateAsync(minEpoch, txn,
consistencyLevel, requestTime);
- return getTxnResult(asyncTxnResult, txn.isWrite(), consistencyLevel,
requestTime);
+ try
+ {
+ return getTxnResult(asyncTxnResult);
+ }
+ catch (TopologyMismatch e)
+ {
+ // For now assuming topology mismatch is caused by a race
misrouting
Review Comment:
> This was particularly painful in that the Accord txn didn't get to the
point where it runs the code that would return the retry different system error
then the correct thing to do is what you did in a previous patch
CASSANDRA-20089, detect that case and await on the epoch... we know what epoch
last modified a table and if accord doesn't know it yet, we can await that and
avoid the costly retry logic and special handling of this exception... that
also gets rid of bugs due to retrying when we should not...
> If you have created the transaction then the coordinator has already
decided the table exists at least once and will loop again and get the cluster
metadata and then find the table doesn't exist attempting to route it because
getTableMetadata will throw IRE.
you can get past that validation with the following, due to race
conditions...
```
insert into tbl ... IF NOT EXISTS;
drop table tbl;
```
this pattern has really bad consequences with this patch as it keeps trying
to handle this exception... when the correct answer to solve the problem you
are trying to solve is to wait for accord to know about the epoch...
See
```
commit c4fef9cdd0b8cc7034f18d45a7bde349ed723b17
Author: Ariel Weisberg <[email protected]>
Date: Thu Nov 14 14:00:30 2024 -0500
Fix Accord SAI tests and Accord double apply
Patch by Ariel Weisberg and David Capwell; Reviewed by David Capwell for
CASSANDRA-20089
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]