[
https://issues.apache.org/jira/browse/CASSANDRA-19633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brandon Williams updated CASSANDRA-19633:
-----------------------------------------
Reviewers: Brandon Williams
> Replaced node is stuck in a loop calculating ranges
> ---------------------------------------------------
>
> Key: CASSANDRA-19633
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19633
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Consistency/Bootstrap and Decommission
> Reporter: Jai Bheemsen Rao Dhanwada
> Assignee: Marcus Eriksson
> Priority: Normal
> Labels: Bootstrap
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: result1.html
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> Hello,
>
> I am running into an issue where in a node that is replacing a dead
> (non-seed) node is stuck in calculating ranges forever. It eventually
> succeeds, however the time taken for calculating the ranges is not constant.
> I do sometimes see that it takes 24 hours to calculate ranges for each
> keyspace. Attached the flume graph of the cassandra process during this time,
> which points to the below code.
> {code:java}
> Multimap<InetAddressAndPort, Range<Token>>
> getRangeFetchMapForNonTrivialRanges()
> {
> //Get the graph with edges between ranges and their source endpoints
> MutableCapacityGraph<Vertex, Integer> graph = getGraph();
> //Add source and destination vertex and edges
> addSourceAndDestination(graph, getDestinationLinkCapacity(graph));
> int flow = 0;
> MaximumFlowAlgorithmResult<Integer, CapacityEdge<Vertex, Integer>> result =
> null;
> //We might not be working on all ranges
> while (flow < getTotalRangeVertices(graph))
> {
> if (flow > 0)
> { //We could not find a path with previous graph. Bump the capacity b/w
> endpoint vertices and destination by 1 incrementCapacity(graph, 1); }
> MaximumFlowAlgorithm fordFulkerson =
> FordFulkersonAlgorithm.getInstance(DFSPathFinder.getInstance());
> result = fordFulkerson.calc(graph, sourceVertex, destinationVertex,
> IntegerNumberSystem.getInstance());
> int newFlow = result.calcTotalFlow();
> assert newFlow > flow; //We are not making progress which should not happen
> flow = newFlow;
> }
> return getRangeFetchMapFromGraphResult(graph, result);
> }
> {code}
> Digging through the logs, I see the below log line for a given keyspace
> `system_auth`
> {code:java}
> INFO [main] 2024-05-10 17:35:02,489 RangeStreamer.java:330 - Bootstrap: range
> Full(/10.135.56.214:7000,(5080189126057290696,5081324396311791613]) exists on
> Full(/10.135.56.157:7000,(5080189126057290696,5081324396311791613]) for
> keyspace system_auth{code}
> corresponding code:
> {code:java}
> for (Map.Entry<Replica, Replica> entry : fetchMap.flattenEntries())
> logger.info("{}: range {} exists on {} for keyspace {}", description,
> entry.getKey(), entry.getValue(), keyspaceName);{code}
> BUT do not see the below line for the corresponding keyspace
> {code:java}
> RangeStreamer.java:606 - Output from RangeFetchMapCalculator for
> keyspace{code}
> this means the code it's stuck in `getRangeFetchMap();`
> {code:java}
> Multimap<InetAddressAndPort, Range<Token>> rangeFetchMapMap =
> calculator.getRangeFetchMap();
> logger.info("Output from RangeFetchMapCalculator for keyspace {}",
> keyspace);{code}
> Here is the cluster topology:
> * Cassandra version: 4.0.12
> * # of nodes: 190
> * Tokens (vnodes): 128
> Initial hypothesis was that the graph calculation was taking longer due to
> the combination of nodes + tokens + tables but in the same cluster I see one
> of the node joined without any issues.
> wondering if I am hitting a bug causing it to work sometimes but get into an
> infinite loop some times?
> Please let me know if you need any other details and appreciate any pointers
> to debug this further.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]