[
https://issues.apache.org/jira/browse/CASSANDRA-19344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836302#comment-17836302
]
Sam Tunnicliffe commented on CASSANDRA-19344:
---------------------------------------------
The actual cause was that the way we construct placement deltas for a
PlacementTransitionPlan did not properly consider transientness. Multi-step
operations always follow the pattern:
* add new write replicas
* add new read replicas/remove old read replicas
* remove old write replicas
So when an operation causes a replica to transition from TRANSIENT to FULL for
the same range (or part of a range), it could become a FULL read replica before
becoming a FULL write replica.
Consider this simplified example where we remove N4 and the effect on N2:
{code}
RF=3/1
At START
10 20 30 40
+---------+---------+---------+---------+---------+
N1 N2 N3 N4
N2 replicates:
(10,20] - FULL (Primary Range)
(,10] + (40,] - FULL
(30,40] - TRANSIENT
After FINISH
10 20 30
+---------+---------+---------+-------------------+
N1 N2 N3
N2 replicates:
(10,20] - FULL (Primary Range)
(,10] + (30,] - FULL
(20,30] - TRANSIENT
In removing N4, N2 gains (20,30] TRANSIENT and (30,40] TRANSIENT -> FULL
Potential problem ->
for READS N2 becomes FULL(30,40] after MID_LEAVE
for WRITES N2 only becomes FULL(30,40] after FINISH_LEAVE
so between the 2 events, coordinators will not send writes to N2 unless one
of the other replicas is unresponsive.
Coordinators will send reads to N2 during this window though.
If cleanup is run before N2 becomes a FULL replica for (30,40], any data
for that range (including that which was
just streamed to it) will be purged.
{code}
Below is an illustration of the ranges replicated by N2 at each step:
{code}
+-----+----------------------------------------------------------------------------------------------------------+
|EPOCH| STATE | RANGES REPLICATED BY N2
|
|-----+------------------------+---------------------------------------------------------------------------------|
|0 | START STATE | WRITES -> FULL: [(40,], (,10], (10,20]]
TRANSIENT: [(30,40]] |
| | | READS -> FULL: [(40,], (,10], (10,20]]
TRANSIENT: [(30,40]] |
|-----+------------------------+---------------------------------------------------------------------------------|
|1 | ENACT START_LEAVE(N4) | WRITES -> FULL: [(40,], (,10], (10,20]]
TRANSIENT: [(20,30], (30,40]] |
| | | READS -> FULL: [(40,], (,10], (10,20]]
TRANSIENT: [(30,40]] |
|-----+------------------------+---------------------------------------------------------------------------------|
|2 | ENACT MID_LEAVE(N4) | WRITES -> FULL: [(40,], (,10], (10,20]]
TRANSIENT: [(20,30], (30,40]] |
| | | READS -> FULL: [(40,], (,10], (10,20],
(30,40]] TRANSIENT: [(20,30]] |
|-----+------------------------+---------------------------------------------------------------------------------|
|3 | ENACT FINISH_LEAVE(N4) | WRITES -> FULL: [(30,], (,10], (10,20]]
TRANSIENT: [(20,30]] |
| | | READS -> FULL: [(30,], (,10], (10,20]]
TRANSIENT: [(20,30]] |
+-----+------------------------+---------------------------------------------------------------------------------+
{code}
After applying the fix here, these are changed so that the {{(30,40]}} changing
from {{TRANSIENT}} to {{FULL}} for
writes is part of enacting the {{START_LEAVE(N4)}} in epoch 1, i.e. before N2
becomes a FULL replica for reads of
{{(30,40]}} when {{MID_LEAVE(N4)}} is enacted in epoch 2.
{code}
+-----+----------------------------------------------------------------------------------------------------------+
|EPOCH| STATE | RANGES REPLICATED BY N2
|
|-----+------------------------+---------------------------------------------------------------------------------|
|0 | START STATE | WRITES -> FULL: [(40,], (,10], (10,20]]
TRANSIENT: [(30,40]] |
| | | READS -> FULL: [(40,], (,10], (10,20]]
TRANSIENT: [(30,40]] |
|-----+------------------------+---------------------------------------------------------------------------------|
|1 | ENACT START_LEAVE(N4) | WRITES -> FULL: [(40,], (,10], (10,20],
(30,40]] TRANSIENT: [(20,30]] |
| | | READS -> FULL: [(40,], (,10], (10,20]]
TRANSIENT: [(30,40]] |
|-----+------------------------+---------------------------------------------------------------------------------|
|2 | ENACT MID_LEAVE(N4) | WRITES -> FULL: [(40,], (,10], (10,20],
(30,40]] TRANSIENT: [(20,30]] |
| | | READS -> FULL: [(40,], (,10], (10,20],
(30,40]] TRANSIENT: [(20,30]] |
|-----+------------------------+---------------------------------------------------------------------------------|
|3 | ENACT FINISH_LEAVE(N4) | WRITES -> FULL: [(30,], (,10], (10,20]]
TRANSIENT: [(20,30]] |
| | | READS -> FULL: [(30,], (,10], (10,20]]
TRANSIENT: [(20,30]] |
+-----+------------------------+---------------------------------------------------------------------------------+
{code}
For more detail I've also attached details showing the specific placement
deltas applied at each of these steps.
> Range movements involving transient replicas must safely enact changes to
> read and write replica sets
> -----------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-19344
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19344
> Project: Cassandra
> Issue Type: Bug
> Components: CI
> Reporter: Ekaterina Dimitrova
> Assignee: Sam Tunnicliffe
> Priority: Normal
> Fix For: 5.x
>
> Attachments: ci_summary.html, remove-n4-post-19344.txt,
> remove-n4-pre-19344.txt, result_details.tar.gz
>
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> (edit) This was originally opened due to a flaky test
> {{org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode-_jdk17}}
> The test can fail in two different ways:
> {code:java}
> junit.framework.AssertionFailedError: NOT IN CURRENT: 31 -- [(00,20),
> (31,50)] at
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.assertAllContained(TransientRangeMovementTest.java:203)
> at
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode(TransientRangeMovementTest.java:183)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method) at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43){code}
> as in here -
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2639/workflows/32b92ce7-5e9d-4efb-8362-d200d2414597/jobs/55139/tests#failed-test-0]
> and
> {code:java}
> junit.framework.AssertionFailedError: nodetool command [removenode,
> 6d194555-f6eb-41d0-c000-000000000003, --force] was not successful stdout:
> stderr: error: Node /127.0.0.4:7012 is alive and owns this ID. Use
> decommission command to remove it from the ring -- StackTrace --
> java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and
> owns this ID. Use decommission command to remove it from the ring at
> org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110)
> at
> org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682)
> at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at
> org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388)
> at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at
> org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at
> org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129)
> at
> org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038)
> at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:833) Notifications: Error:
> java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and
> owns this ID. Use decommission command to remove it from the ring at
> org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110)
> at
> org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682)
> at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at
> org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388)
> at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at
> org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at
> org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129)
> at
> org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038)
> at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:833) at
> org.apache.cassandra.distributed.api.NodeToolResult$Asserts.fail(NodeToolResult.java:214)
> at
> org.apache.cassandra.distributed.api.NodeToolResult$Asserts.success(NodeToolResult.java:97)
> at
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode(TransientRangeMovementTest.java:173)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method) at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43){code}
> as in here -
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2634/workflows/24617d26-e297-4857-bc43-b6a04e64a6ea/jobs/54534/tests#failed-test-0]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]