[ https://issues.apache.org/jira/browse/CASSANDRA-19633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846595#comment-17846595 ]
Marcus Eriksson edited comment on CASSANDRA-19633 at 5/15/24 12:54 PM: ----------------------------------------------------------------------- I now have this reproducing locally using tokens/datacenters provided by [~jaid], thanks! This problem is indeed that we only provide one source for each range post CASSANDRA-14405. The 4650-optimisation is done by setting up a graph where each source -> destination edge has the capacity of one - optimally we would stream each range from a unique source. If we can't find a flow using the capacity of one, we bump the capacity on all edges to two and try to calculate the flow again, and repeat this bumping until we have found a flow where we can [stream all ranges|https://github.com/apache/cassandra/blob/75794540573b6f0c39094b5448fe73326e14e058/src/java/org/apache/cassandra/dht/RangeFetchMapCalculator.java#L134-L148]. The problem is though that with only a single source per range we are very far away from being able to find a flow with edge-capacities of 1, so in this cluster this calculation is done 1000+ times, and each calculation takes several minutes (per keyspace). And the result is terrible anyway because we end up streaming from only two sources. Removing [this|https://github.com/apache/cassandra/blob/6bae4f76fb043b4c3a3886178b5650b280e9a50b/src/java/org/apache/cassandra/dht/RangeStreamer.java#L531] line allows us to only do the calculation once and we stream from 191 sources. It still takes several minutes to do the calculation, but we most likely save more time due to quicker streaming later. But as the comment says some downstream uses of {{sources}} requires it to be only a single node, so I'll need to fix those places before submitting a patch. was (Author: krummas): I now have this reproducing locally using tokens/datacenters provided by [~jaid], thanks! This problem is indeed that we only provide one source for each range post CASSANDRA-14405. The 4650-optimisation is done by setting up a graph where each source -> destination edge has the capacity of one - optimally we would stream each range from a unique source. If we can't find a flow using the capacity of one, we bump the capacity on all edges to two and try to calculate the flow again, and repeat this bumping until we have found a flow where we can [stream all ranges|https://github.com/apache/cassandra/blob/75794540573b6f0c39094b5448fe73326e14e058/src/java/org/apache/cassandra/dht/RangeFetchMapCalculator.java#L134-L148]. The problem is though that with only a single source per range we are very far away from being able to find a flow with edge-capacities of 1, so in this cluster do this calculation 1000+ times, and each calculation takes several minutes (per keyspace). And the result is terrible anyway because we end up streaming from only two sources. Removing [this|https://github.com/apache/cassandra/blob/6bae4f76fb043b4c3a3886178b5650b280e9a50b/src/java/org/apache/cassandra/dht/RangeStreamer.java#L531] line allows us to only do the calculation once and we stream from 191 sources. It still takes several minutes to do the calculation, but we most likely save more time due to quicker streaming later. But as the comment says some downstream uses of {{sources}} requires it to be only a single node, so I'll need to fix those places before submitting a patch. > Replaced node is stuck in a loop calculating ranges > --------------------------------------------------- > > Key: CASSANDRA-19633 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19633 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission > Reporter: Jai Bheemsen Rao Dhanwada > Assignee: Marcus Eriksson > Priority: Normal > Labels: Bootstrap > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: result1.html > > > Hello, > > I am running into an issue where in a node that is replacing a dead > (non-seed) node is stuck in calculating ranges forever. It eventually > succeeds, however the time taken for calculating the ranges is not constant. > I do sometimes see that it takes 24 hours to calculate ranges for each > keyspace. Attached the flume graph of the cassandra process during this time, > which points to the below code. > {code:java} > Multimap<InetAddressAndPort, Range<Token>> > getRangeFetchMapForNonTrivialRanges() > { > //Get the graph with edges between ranges and their source endpoints > MutableCapacityGraph<Vertex, Integer> graph = getGraph(); > //Add source and destination vertex and edges > addSourceAndDestination(graph, getDestinationLinkCapacity(graph)); > int flow = 0; > MaximumFlowAlgorithmResult<Integer, CapacityEdge<Vertex, Integer>> result = > null; > //We might not be working on all ranges > while (flow < getTotalRangeVertices(graph)) > { > if (flow > 0) > { //We could not find a path with previous graph. Bump the capacity b/w > endpoint vertices and destination by 1 incrementCapacity(graph, 1); } > MaximumFlowAlgorithm fordFulkerson = > FordFulkersonAlgorithm.getInstance(DFSPathFinder.getInstance()); > result = fordFulkerson.calc(graph, sourceVertex, destinationVertex, > IntegerNumberSystem.getInstance()); > int newFlow = result.calcTotalFlow(); > assert newFlow > flow; //We are not making progress which should not happen > flow = newFlow; > } > return getRangeFetchMapFromGraphResult(graph, result); > } > {code} > Digging through the logs, I see the below log line for a given keyspace > `system_auth` > {code:java} > INFO [main] 2024-05-10 17:35:02,489 RangeStreamer.java:330 - Bootstrap: range > Full(/10.135.56.214:7000,(5080189126057290696,5081324396311791613]) exists on > Full(/10.135.56.157:7000,(5080189126057290696,5081324396311791613]) for > keyspace system_auth{code} > corresponding code: > {code:java} > for (Map.Entry<Replica, Replica> entry : fetchMap.flattenEntries()) > logger.info("{}: range {} exists on {} for keyspace {}", description, > entry.getKey(), entry.getValue(), keyspaceName);{code} > BUT do not see the below line for the corresponding keyspace > {code:java} > RangeStreamer.java:606 - Output from RangeFetchMapCalculator for > keyspace{code} > this means the code it's stuck in `getRangeFetchMap();` > {code:java} > Multimap<InetAddressAndPort, Range<Token>> rangeFetchMapMap = > calculator.getRangeFetchMap(); > logger.info("Output from RangeFetchMapCalculator for keyspace {}", > keyspace);{code} > Here is the cluster topology: > * Cassandra version: 4.0.12 > * # of nodes: 190 > * Tokens (vnodes): 128 > Initial hypothesis was that the graph calculation was taking longer due to > the combination of nodes + tokens + tables but in the same cluster I see one > of the node joined without any issues. > wondering if I am hitting a bug causing it to work sometimes but get into an > infinite loop some times? > Please let me know if you need any other details and appreciate any pointers > to debug this further. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org