[
https://issues.apache.org/jira/browse/CASSANDRA-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksandr Sorokoumov updated CASSANDRA-14801:
---------------------------------------------
Test and Documentation Plan:
I reproduced the bug using simple [randomized
test|https://gist.github.com/Gerrrr/f59dc5dedaf6501bb31d79b068244213] that
creates a cluster of N nodes and adds, moves, and removes nodes from it. I also
found another bug where {{TokenMetadata#calculatePendingRanges}} fails during
move-affected replica calculation.
Patch that addresses both problems ([link to
PR|https://github.com/apache/cassandra/pull/495]):
*
[a534b2|https://github.com/Gerrrr/cassandra/commit/a534b2be9a653fb0cdda75043c0b79e481ca1701]
adds tests that reproduce both bugs.
*
[bb36cd|https://github.com/Gerrrr/cassandra/commit/bb36cd09c164840ad9ab3231f1039c443f8f040c]
fixes the newly discovered bug ({{test1Leave1Move}} in
[a534b2|https://github.com/Gerrrr/cassandra/commit/a534b2be9a653fb0cdda75043c0b79e481ca1701]).
There, the failure happens because for the same pending range we can include 2
replicas with the same endpoint and different ranges (violation of
{{PendingRangeMaps Conflict.DUPLICATE}} policy). This happens because right now
we include in {{PendingRangeMaps}} the entire new replica after leave for
leave-affected ranges and only the pending part of it for move-affected ranges.
This commit marks as pending only the missing part of the new replica.
*
[a33b03|https://github.com/Gerrrr/cassandra/commit/a33b032cd75044db3475505cd32e6afd6f98ad40]
solves the original issue. Without this commit {{getPendingRanges}} can fail
if the same replica covers more than 1 pending range. I think that this is a
valid situation. Consider {{testLeave2}} in
[a534b2|https://github.com/Gerrrr/cassandra/commit/a534b2be9a653fb0cdda75043c0b79e481ca1701].
In this case replica {{Full(127.0.0.1:7012,(-9,0])}} covers 2 ranges -
{{(-9,-4]}} and {{(-4,0]}}. If we run the same test against 3.11 {{(-9, 0]}} is
represented as {{\{(-9, -4], (-4, 0]\}}} and that possible representation would
not trigger the bug in 4.0. However, I think that we should not base safety of
{{getPendingRanges}} execution on the way a particular
{{AbstractReplicationStrategy}} represents a range and we should allow
duplicate entries while building pending {{RangesAtEndpoint}}.
With these changes, the randomized test hasn't failed after running for a
while. As I mentioned, it is a very simplistic test, so any suggestions for
improvement are welcome. I haven't included it in the PR, as it seems to be a
good candidate to become flaky. Maybe it is worth adding as a {{long}} test?
Status: Patch Available (was: In Progress)
> calculatePendingRanges no longer safe for multiple adjacent range movements
> ---------------------------------------------------------------------------
>
> Key: CASSANDRA-14801
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14801
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Coordination, Legacy/Distributed Metadata
> Reporter: Benedict Elliott Smith
> Assignee: Aleksandr Sorokoumov
> Priority: Normal
> Labels: pull-request-available
> Fix For: 4.0, 4.0-beta
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Correctness depended upon the narrowing to a {{Set<InetAddressAndport>}},
> which we no longer do - we maintain a collection of all {{Replica}}. Our
> {{RangesAtEndpoint}} collection built by {{getPendingRanges}} can as a result
> contain the same endpoint multiple times; and our {{EndpointsForToken}}
> obtained by {{TokenMetadata.pendingEndpointsFor}} may fail to be constructed,
> resulting in cluster-wide failures for writes to the affected token ranges
> for the duration of the range movement.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]