Re: [PR] CASSANDRA-20245: Fix problems and race conditions with topology fetching [cassandra]

via GitHub Wed, 29 Jan 2025 08:35:00 -0800


dcapwell commented on code in PR #3842:
URL: https://github.com/apache/cassandra/pull/3842#discussion_r1934218316



##########
src/java/org/apache/cassandra/service/accord/FetchTopology.java:
##########
@@ -123,18 +116,20 @@ public Response(long epoch, Topology topology)
         long epoch = message.payload.epoch;
         Topology topology = 
AccordService.instance().topology().maybeGlobalForEpoch(epoch);
         if (topology == null)
-            MessagingService.instance().respond(Response.UNKNOWN, message);
+            MessagingService.instance().respond(Response.unkonwn(epoch), 
message);
         else
             MessagingService.instance().respond(new Response(epoch, topology), 
message);
     };
 
+    private static final Logger logger = 
LoggerFactory.getLogger(FetchTopology.class);
+
     public static Future<Topology> fetch(SharedContext context, 
Collection<InetAddressAndPort> peers, long epoch)
     {
         FetchTopology req = new FetchTopology(epoch);
-        return context.messaging().<FetchTopology, 
Response>sendWithRetries(Verb.ACCORD_FETCH_TOPOLOGY_REQ, req, 
MessagingUtils.tryAliveFirst(SharedContext.Global.instance, peers),
-                                                                               
           // If the epoch is already discovered, no need to retry
-                                                                               
           (attempt, from, failure) -> AccordService.instance().currentEpoch() 
< epoch,
-                                                                               
           MessageDelivery.RetryErrorMessage.EMPTY)
+        return context.messaging().<FetchTopology, 
Response>sendWithRetries(Verb.ACCORD_FETCH_TOPOLOGY_REQ, req,
+                                                                            
MessagingUtils.tryAliveFirst(SharedContext.Global.instance, peers),

Review Comment:
   is it intentional that this will loop forever?  the first pass only does the 
alive nodes, then when all peers are tested we reset the iterator and try again.
   
   I feel that its unsafe to use a `Backoff.NO_OP.INSTANCE` in this instance, 
we are blocking 
`org.apache.cassandra.service.accord.AccordConfigurationService#fetchTopologyInternal`
 for an unbounded amount of time and don't have any way to notify that things 
are going wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscr...@cassandra.apache.org
For additional commands, e-mail: pr-h...@cassandra.apache.org

Re: [PR] CASSANDRA-20245: Fix problems and race conditions with topology fetching [cassandra]

Reply via email to