Re: [PR] [CELEBORN-1146] Improve the algorithm for available workers [incubator-celeborn]

via GitHub Sat, 02 Dec 2023 10:40:49 -0800


mridulm commented on code in PR #2119:
URL: 
https://github.com/apache/incubator-celeborn/pull/2119#discussion_r1412837658



##########
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala:
##########
@@ -666,23 +666,32 @@ private[celeborn] class Master(
     val shuffleKey = Utils.makeShuffleKey(requestSlots.applicationId, 
requestSlots.shuffleId)
 
     val availableWorkers = workersAvailable()
+    // reply false if all workers are unavailable
+    if (availableWorkers.isEmpty) {
+      logError(
+        s"Non available workers, offer slots for $numReducers reducers of 
$shuffleKey failed!")
+      context.reply(RequestSlotsResponse(StatusCode.SLOT_NOT_AVAILABLE, new 
WorkerResource()))
+      return
+    }
+
     val numAvailableWorkers = availableWorkers.size()
     val numWorkers = Math.min(
       Math.max(
         if (requestSlots.shouldReplicate) 2 else 1,
         if (requestSlots.maxWorkers <= 0) slotsAssignMaxWorkers
         else Math.min(slotsAssignMaxWorkers, requestSlots.maxWorkers)),
       numAvailableWorkers)
+
+    // We treated availableWorkers as a Circular Queue here.
     val startIndex = Random.nextInt(numAvailableWorkers)
+    val endIndex = startIndex + numWorkers
+
     val selectedWorkers = new util.ArrayList[WorkerInfo](numWorkers)
-    selectedWorkers.addAll(availableWorkers.subList(
-      startIndex,
-      Math.min(numAvailableWorkers, startIndex + numWorkers)))
-    if (startIndex + numWorkers > numAvailableWorkers) {
-      selectedWorkers.addAll(availableWorkers.subList(
-        0,
-        startIndex + numWorkers - numAvailableWorkers))
+    for (index <- startIndex until endIndex) {
+      val realIndex = index % numAvailableWorkers
+      selectedWorkers.add(availableWorkers.get(realIndex))
     }
+

Review Comment:
   From a complexity point of view, this is counterintutive- given the input is 
a `LinkedList` in this example, the `get` version is O(n^2) complexity, while 
the subList is O(n).
   
   Not sure how this was benchmarkeed - but for microbenchmarks, I would 
suggest using something like JMH.
   If the benchmark code is what is shared above - in this specific example  I 
would assume it it is overwhelmed by noise and precision issues - it needs to 
be statistically significant and the precision of measurement should have been 
in ns.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-1146] Improve the algorithm for available workers [incubator-celeborn]

Reply via email to