[ 
https://issues.apache.org/jira/browse/CASSANDRA-20955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18070395#comment-18070395
 ] 

Ariel Weisberg edited comment on CASSANDRA-20955 at 4/1/26 5:43 PM:
--------------------------------------------------------------------

  Stale shards cause {{IllegalStateException}} during round-trip migration, 
breaking schema agreement in tests
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                    
  When a keyspace goes tracked → untracked → tracked, the second transition 
back to tracked throws:                                                         
                                                                                
                                                                                
                                                                                
                        

  {noformat}
  java.lang.IllegalStateException: Existing shard found for keyspace, but prev 
ksn has mutation tracking disabled
      at 
MutationTrackingService$KeyspaceShards$UpdateDecision.decisionForTopologyChange(MutationTrackingService.java:1096)
      at 
MutationTrackingService.onNewClusterMetadata(MutationTrackingService.java:849)
      at 
MutationTrackingService$1.notifyPostCommit(MutationTrackingService.java:206)
      at LocalLog.processPendingInternal(LocalLog.java:548)
  {noformat}

  The problem is that shard cleanup isn't implemented yet. When a keyspace 
migrates from tracked to untracked, {{MIGRATE_FROM}} falls through to {{NONE}} 
(line 924-928), so the old shards stick around. The TODO on line 925 already 
acknowledges this. After migration completes the shards are still there, and 
when we later flip back to tracked, {{decisionForTopologyChange()}} hits the 
{{!prev.useMutationTracking() &&
  next.useMutationTracking()}} branch where the precondition 
{{checkState(!hasExisting)}} blows up.

  This causes an exception to fire in 
{{MutationTrackingService$1.notifyPostCommit()}}, which runs as a 
{{ChangeListener}} in {{LocalLog.processPendingInternal()}}. That aborts the 
rest of the listener chain for that epoch on that node, so 
{{SchemaListener.notifyPostCommit()}} never runs, 
{{SchemaDiagnostics.versionUpdated()}} never fires, and the dtest framework's
  {{SchemaChangeMonitor}} never gets the callback. Result is a 120-second hang 
waiting for schema agreement.

  Both are marked with {{TODO (CASSANDRA-20955)}}. Once shard cleanup is 
properly implemented these should go back to being precondition checks.



was (Author: aweisberg):
  Stale shards cause {{IllegalStateException}} during round-trip migration, 
breaking schema agreement in tests
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                    
  When a keyspace goes tracked → untracked → tracked, the second transition 
back to tracked throws:                                                         
                                                                                
                                                                                
                                                                                
                        

  {noformat}
  java.lang.IllegalStateException: Existing shard found for keyspace, but prev 
ksn has mutation tracking disabled
      at 
MutationTrackingService$KeyspaceShards$UpdateDecision.decisionForTopologyChange(MutationTrackingService.java:1096)
      at 
MutationTrackingService.onNewClusterMetadata(MutationTrackingService.java:849)
      at 
MutationTrackingService$1.notifyPostCommit(MutationTrackingService.java:206)
      at LocalLog.processPendingInternal(LocalLog.java:548)
  {noformat}

  The problem is that shard cleanup isn't implemented yet. When a keyspace 
migrates from tracked to untracked, {{MIGRATE_FROM}} falls through to {{NONE}} 
(line 924-928), so the old shards stick around. The TODO on line 925 already 
acknowledges this. After migration completes the shards are still there, and 
when we later flip back to tracked, {{decisionForTopologyChange()}} hits the 
{{!prev.useMutationTracking() &&
  next.useMutationTracking()}} branch where the precondition 
{{checkState(!hasExisting)}} blows up.

  This also has a nasty secondary effect: the exception fires inside 
{{MutationTrackingService$1.notifyPostCommit()}}, which runs as a 
{{ChangeListener}} in {{LocalLog.processPendingInternal()}}. That aborts the 
rest of the listener chain for that epoch on that node, so 
{{SchemaListener.notifyPostCommit()}} never runs, 
{{SchemaDiagnostics.versionUpdated()}} never fires, and the dtest framework's
  {{SchemaChangeMonitor}} never gets the callback. Result is a 120-second hang 
waiting for schema agreement.

  Worked around this in CASSANDRA-21098 by returning {{REPLICA_GROUP}} instead 
of throwing when {{hasExisting}} is true, in two places in 
{{decisionForTopologyChange()}}:
  - Line 1076-1078: {{prevKsm == null}} case (already had this pattern)
  - Lines 1095-1102: untracked→tracked case (new)

  Both are marked with {{TODO (CASSANDRA-20955)}}. Once shard cleanup is 
properly implemented these should go back to being precondition checks.


> CEP-45: Add support for dropping tables & keyspaces
> ---------------------------------------------------
>
>                 Key: CASSANDRA-20955
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20955
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination
>            Reporter: Blake Eggleston
>            Priority: Normal
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to