[PR] Limit concurrent goroutines in cluster node probing [kvrocks-controller]

via GitHub Tue, 18 Mar 2025 03:34:19 -0700


Jitmisra opened a new pull request, #280:
URL: https://github.com/apache/kvrocks-controller/pull/280


   In the current implementation of parallelProbeNodes(), a new goroutine is 
spawned for every node in the cluster without any limit on concurrent 
operations. This can lead to resource exhaustion in large deployments with many 
nodes.
   
   `func (c *ClusterChecker) parallelProbeNodes(ctx context.Context, cluster 
*store.Cluster) {
       for i, shard := range cluster.Shards {
           for _, node := range shard.Nodes {
               go func(shardIdx int, n store.Node) {
                   // Node probing code...
               }(i, node)
           }
       }
       // No waiting mechanism or concurrency control
   }`
   
   Problems
   Resource Exhaustion: In large clusters (hundreds or thousands of nodes), 
this creates too many goroutines at once
   Network Floods: All probes happen simultaneously, potentially overwhelming 
network connections
   No Completion Guarantee: The function returns immediately without ensuring 
all probes complete
   System Resource Strain: Can exhaust system resources like file descriptors 
for network connections
   Solution
   This PR implements a semaphore pattern to limit the number of concurrent 
operations while still maintaining parallelism:
   
   This change makes the system more resilient and scalable for production 
deployments of all sizes.
   This PR implements a semaphore pattern to limit the number of concurrent 
operations while still maintaining parallelism:
   
   `func (c *ClusterChecker) parallelProbeNodes(ctx context.Context, cluster 
*store.Cluster) {
       // Limit concurrent operations to a reasonable number
       semaphore := make(chan struct{}, 20) // Configurable limit
       var wg sync.WaitGroup
   
       for i, shard := range cluster.Shards {
           for _, node := range shard.Nodes {
               wg.Add(1)
               go func(shardIdx int, n store.Node) {
                   defer wg.Done()
                   
                   // Acquire semaphore before proceeding
                   semaphore <- struct{}{}
                   defer func() { <-semaphore }()
                   
                   // Existing node probing code...
               }(i, node)
           }
       }
   
       // Wait for all probes to complete
       wg.Wait()
   }`
   
   This change makes the system more resilient and scalable for production 
deployments of all sizes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Limit concurrent goroutines in cluster node probing [kvrocks-controller]

Reply via email to