Jai Bheemsen Rao Dhanwada created CASSANDRA-19633:
-----------------------------------------------------
Summary: Replaced node is stuck in a loop calculating ranges
Key: CASSANDRA-19633
URL: https://issues.apache.org/jira/browse/CASSANDRA-19633
Project: Cassandra
Issue Type: Bug
Reporter: Jai Bheemsen Rao Dhanwada
Attachments: result1.html
Hello,
I am running into an issue where in a node that is replacing a dead (non-seed)
node is stuck in calculating ranges forever. It eventually succeeds, however
the time taken for calculating the ranges is not constant. I do sometimes see
that it takes 24 hours to calculate ranges for each keyspace. Attached the
flume graph of the cassandra process during this time, which points to the
below code.
```
Multimap<InetAddressAndPort, Range<Token>> getRangeFetchMapForNonTrivialRanges()
{
//Get the graph with edges between ranges and their source endpoints
MutableCapacityGraph<Vertex, Integer> graph = getGraph();
//Add source and destination vertex and edges
addSourceAndDestination(graph, getDestinationLinkCapacity(graph));
int flow = 0;
MaximumFlowAlgorithmResult<Integer, CapacityEdge<Vertex, Integer>> result =
null;
//We might not be working on all ranges
while (flow < getTotalRangeVertices(graph))
{
if (flow > 0)
{
//We could not find a path with previous graph. Bump the capacity b/w endpoint
vertices and destination by 1
incrementCapacity(graph, 1);
}
MaximumFlowAlgorithm fordFulkerson =
FordFulkersonAlgorithm.getInstance(DFSPathFinder.getInstance());
result = fordFulkerson.calc(graph, sourceVertex, destinationVertex,
IntegerNumberSystem.getInstance());
int newFlow = result.calcTotalFlow();
assert newFlow > flow; //We are not making progress which should not happen
flow = newFlow;
}
return getRangeFetchMapFromGraphResult(graph, result);
}
```
Digging through the logs, I see the below log line for a given keyspace
`system_auth`
```
INFO [main] 2024-05-10 17:35:02,489 RangeStreamer.java:330 - Bootstrap: range
Full(/10.135.56.214:7000,(5080189126057290696,5081324396311791613]) exists on
Full(/10.135.56.157:7000,(5080189126057290696,5081324396311791613]) for
keyspace system_auth
```
corresponding code:
```
for (Map.Entry<Replica, Replica> entry : fetchMap.flattenEntries())
logger.info("{}: range {} exists on {} for keyspace {}", description,
entry.getKey(), entry.getValue(), keyspaceName);
```
BUT do not see the below line for the corresponding keyspace
```
RangeStreamer.java:606 - Output from RangeFetchMapCalculator for keyspace
```
this means the code it's stuck in `getRangeFetchMap();`
```
Multimap<InetAddressAndPort, Range<Token>> rangeFetchMapMap =
calculator.getRangeFetchMap();
logger.info("Output from RangeFetchMapCalculator for keyspace {}", keyspace);
```
Here is the cluster topology:
* Cassandra version: 4.0.12
* # of nodes: 190
* Tokens (vnodes): 128
Initial hypothesis was that the graph calculation was taking longer due to the
combination of nodes + tokens + tables but in the same cluster I see one of the
node joined without any issues.
wondering if I am hitting a bug causing it to work sometimes but get into an
infinite loop some times?
Please let me know if you need any other details and appreciate any pointers to
debug this further.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]