[jira] [Commented] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets

Sam Tunnicliffe (Jira) Thu, 11 Apr 2024 10:57:04 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836302#comment-17836302
 ]


Sam Tunnicliffe commented on CASSANDRA-19344:
---------------------------------------------

The actual cause was that the way we construct placement deltas for a 
PlacementTransitionPlan did not properly consider transientness. Multi-step 
operations always follow the pattern:

* add new write replicas
* add new read replicas/remove old read replicas
* remove old write replicas

So when an operation causes a replica to transition from TRANSIENT to FULL for 
the same range (or part of a range), it could become a FULL read replica before 
becoming a FULL write replica.
Consider this simplified example where we remove N4 and the effect on N2:

{code}
RF=3/1
At START 
          10        20        30        40         
+---------+---------+---------+---------+---------+
          N1        N2        N3        N4

N2 replicates:
  (10,20]       - FULL (Primary Range)
  (,10] + (40,] - FULL
  (30,40]       - TRANSIENT

After FINISH

          10        20        30        
+---------+---------+---------+-------------------+
          N1        N2        N3        

N2 replicates:
  (10,20]       - FULL (Primary Range)
  (,10] + (30,] - FULL
  (20,30]       - TRANSIENT

In removing N4, N2 gains (20,30] TRANSIENT and (30,40] TRANSIENT -> FULL 
Potential problem -> 
    for READS N2 becomes FULL(30,40] after MID_LEAVE 
    for WRITES N2 only becomes FULL(30,40] after FINISH_LEAVE
    so between the 2 events, coordinators will not send writes to N2 unless one 
of the other replicas is unresponsive. 
    Coordinators will send reads to N2 during this window though. 
    If cleanup is run before N2 becomes a FULL replica for (30,40], any data 
for that range (including that which was 
    just streamed to it) will be purged.
{code}

Below is an illustration of the ranges replicated by N2 at each step:

{code}
+-----+----------------------------------------------------------------------------------------------------------+
|EPOCH| STATE                  | RANGES REPLICATED BY N2                        
                                 |
|-----+------------------------+---------------------------------------------------------------------------------|
|0    | START STATE            | WRITES -> FULL: [(40,], (,10], (10,20]] 
TRANSIENT: [(30,40]]                    |
|     |                        |  READS -> FULL: [(40,], (,10], (10,20]] 
TRANSIENT: [(30,40]]                    |
|-----+------------------------+---------------------------------------------------------------------------------|
|1    | ENACT START_LEAVE(N4)  | WRITES -> FULL: [(40,], (,10], (10,20]] 
TRANSIENT: [(20,30], (30,40]]           |
|     |                        |  READS -> FULL: [(40,], (,10], (10,20]] 
TRANSIENT: [(30,40]]                    |
|-----+------------------------+---------------------------------------------------------------------------------|
|2    | ENACT MID_LEAVE(N4)    | WRITES -> FULL: [(40,], (,10], (10,20]]        
  TRANSIENT: [(20,30], (30,40]]  |
|     |                        |  READS -> FULL: [(40,], (,10], (10,20], 
(30,40]] TRANSIENT: [(20,30]]           |
|-----+------------------------+---------------------------------------------------------------------------------|
|3    | ENACT FINISH_LEAVE(N4) | WRITES -> FULL: [(30,], (,10], (10,20]] 
TRANSIENT: [(20,30]]                    |
|     |                        |  READS -> FULL: [(30,], (,10], (10,20]] 
TRANSIENT: [(20,30]]                    |
+-----+------------------------+---------------------------------------------------------------------------------+
{code}

After applying the fix here, these are changed so that the {{(30,40]}} changing 
from {{TRANSIENT}} to {{FULL}} for 
writes is part of enacting the {{START_LEAVE(N4)}} in epoch 1, i.e. before N2 
becomes a FULL replica for reads of
{{(30,40]}} when {{MID_LEAVE(N4)}} is enacted in epoch 2. 

{code}
+-----+----------------------------------------------------------------------------------------------------------+
|EPOCH| STATE                  | RANGES REPLICATED BY N2                        
                                 |
|-----+------------------------+---------------------------------------------------------------------------------|
|0    | START STATE            |  WRITES -> FULL: [(40,], (,10], (10,20]] 
TRANSIENT: [(30,40]]                   |
|     |                        |   READS -> FULL: [(40,], (,10], (10,20]] 
TRANSIENT: [(30,40]]                   |
|-----+------------------------+---------------------------------------------------------------------------------|
|1    | ENACT START_LEAVE(N4)  | WRITES -> FULL: [(40,], (,10], (10,20], 
(30,40]] TRANSIENT: [(20,30]]           |
|     |                        |  READS -> FULL: [(40,], (,10], (10,20]]        
  TRANSIENT: [(30,40]]           |
|-----+------------------------+---------------------------------------------------------------------------------|
|2    | ENACT MID_LEAVE(N4)    | WRITES -> FULL: [(40,], (,10], (10,20], 
(30,40]] TRANSIENT: [(20,30]]           |
|     |                        |  READS -> FULL: [(40,], (,10], (10,20], 
(30,40]] TRANSIENT: [(20,30]]           |
|-----+------------------------+---------------------------------------------------------------------------------|
|3    | ENACT FINISH_LEAVE(N4) | WRITES -> FULL: [(30,], (,10], (10,20]] 
TRANSIENT: [(20,30]]                    |
|     |                        |  READS -> FULL: [(30,], (,10], (10,20]] 
TRANSIENT: [(20,30]]                    |
+-----+------------------------+---------------------------------------------------------------------------------+
{code}

For more detail I've also attached details showing the specific placement 
deltas applied at each of these steps.


> Range movements involving transient replicas must safely enact changes to 
> read and write replica sets
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19344
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19344
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CI
>            Reporter: Ekaterina Dimitrova
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: ci_summary.html, remove-n4-post-19344.txt, 
> remove-n4-pre-19344.txt, result_details.tar.gz
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> (edit) This was originally opened due to a flaky test 
> {{org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode-_jdk17}}
> The test can fail in two different ways:
> {code:java}
> junit.framework.AssertionFailedError: NOT IN CURRENT: 31 -- [(00,20), 
> (31,50)] at 
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.assertAllContained(TransientRangeMovementTest.java:203)
>  at 
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode(TransientRangeMovementTest.java:183)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43){code}
> as in here - 
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2639/workflows/32b92ce7-5e9d-4efb-8362-d200d2414597/jobs/55139/tests#failed-test-0]
> and
> {code:java}
> junit.framework.AssertionFailedError: nodetool command [removenode, 
> 6d194555-f6eb-41d0-c000-000000000003, --force] was not successful stdout: 
> stderr: error: Node /127.0.0.4:7012 is alive and owns this ID. Use 
> decommission command to remove it from the ring -- StackTrace -- 
> java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and 
> owns this ID. Use decommission command to remove it from the ring at 
> org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110)
>  at 
> org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682)
>  at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at 
> org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at 
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388)
>  at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at 
> org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at 
> org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129)
>  at 
> org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038)
>  at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>  at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.base/java.lang.Thread.run(Thread.java:833) Notifications: Error: 
> java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and 
> owns this ID. Use decommission command to remove it from the ring at 
> org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110)
>  at 
> org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682)
>  at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at 
> org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at 
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388)
>  at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at 
> org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at 
> org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129)
>  at 
> org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038)
>  at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>  at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.base/java.lang.Thread.run(Thread.java:833) at 
> org.apache.cassandra.distributed.api.NodeToolResult$Asserts.fail(NodeToolResult.java:214)
>  at 
> org.apache.cassandra.distributed.api.NodeToolResult$Asserts.success(NodeToolResult.java:97)
>  at 
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode(TransientRangeMovementTest.java:173)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43){code}
> as in here - 
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2634/workflows/24617d26-e297-4857-bc43-b6a04e64a6ea/jobs/54534/tests#failed-test-0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets

Reply via email to