Re: [PR] [CELEBORN-1146] Improve the algorithm for available workers [incubator-celeborn]

via GitHub Sun, 03 Dec 2023 19:51:17 -0800


mridulm commented on code in PR #2119:
URL: 
https://github.com/apache/incubator-celeborn/pull/2119#discussion_r1413334606



##########
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala:
##########
@@ -666,23 +666,32 @@ private[celeborn] class Master(
     val shuffleKey = Utils.makeShuffleKey(requestSlots.applicationId, 
requestSlots.shuffleId)
 
     val availableWorkers = workersAvailable()
+    // reply false if all workers are unavailable
+    if (availableWorkers.isEmpty) {
+      logError(
+        s"Non available workers, offer slots for $numReducers reducers of 
$shuffleKey failed!")
+      context.reply(RequestSlotsResponse(StatusCode.SLOT_NOT_AVAILABLE, new 
WorkerResource()))
+      return
+    }
+
     val numAvailableWorkers = availableWorkers.size()
     val numWorkers = Math.min(
       Math.max(
         if (requestSlots.shouldReplicate) 2 else 1,
         if (requestSlots.maxWorkers <= 0) slotsAssignMaxWorkers
         else Math.min(slotsAssignMaxWorkers, requestSlots.maxWorkers)),
       numAvailableWorkers)
+
+    // We treated availableWorkers as a Circular Queue here.
     val startIndex = Random.nextInt(numAvailableWorkers)
+    val endIndex = startIndex + numWorkers
+
     val selectedWorkers = new util.ArrayList[WorkerInfo](numWorkers)
-    selectedWorkers.addAll(availableWorkers.subList(
-      startIndex,
-      Math.min(numAvailableWorkers, startIndex + numWorkers)))
-    if (startIndex + numWorkers > numAvailableWorkers) {
-      selectedWorkers.addAll(availableWorkers.subList(
-        0,
-        startIndex + numWorkers - numAvailableWorkers))
+    for (index <- startIndex until endIndex) {
+      val realIndex = index % numAvailableWorkers
+      selectedWorkers.add(availableWorkers.get(realIndex))
     }
+

Review Comment:
   >  so that the very small performance gap is acceptable from my side, and 
the focus of this patch is to improve readability.
   
   While readability and maintainability are very important, there is nothing 
inherently unmaintainable about using `subList`.
   IMO performance is also important - atleast for our deployment, where we are 
running millions of shuffles a day for PB's of data, it is :-)
   
   > we can see that there is a similar algorithm in the 
`Master#handleRequestSlots`
   
   There the subList implementation would not work - it is inserting a tuple 
into slots map based on the worker, while here it is a bulk addition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-1146] Improve the algorithm for available workers [incubator-celeborn]

Reply via email to