agrawaldevesh commented on a change in pull request #29211:
URL: https://github.com/apache/spark/pull/29211#discussion_r463860185



##########
File path: 
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala
##########
@@ -327,4 +354,28 @@ private[storage] class BlockManagerDecommissioner(
     }
     logInfo("Stopped storage decommissioner")
   }
+
+  /*
+   *  Returns the last migration time and a boolean for if all blocks have 
been migrated.
+   *  If there are any tasks running since that time the boolean may be 
incorrect.
+   */
+  private[storage] def lastMigrationInfo(): (Long, Boolean) = {
+    if (stopped || (stoppedRDD && stoppedShuffle)) {
+      (System.nanoTime(), true)
+    } else {
+      // Chose the min of the running times.
+      val lastMigrationTime = if (
+        conf.get(config.STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED) &&
+        conf.get(config.STORAGE_DECOMMISSION_RDD_BLOCKS_ENABLED)) {
+        Math.min(lastRDDMigrationTime, lastShuffleMigrationTime)

Review comment:
       > @agrawaldevesh: what is your opinion/idea?
   
   I confess that even I am finding the logic in the lastMigrationInfo a bit 
hard to follow. And as far as I can see, this logic is the key to getting the 
shutdown thread to exit cleanly. If the information returned by 
lastMigrationInfo is wrong then the shutdown thread may exit prematurely or 
never. Both of which should be avoided, particularly because the shutdown 
thread has no timeout on the amount of time it can hang around. 
   
   This complexity is intrinsic. The logic is indeed complex but I think what 
we can add some more documentation here, perhaps explaining the intent of the 
code using a couple of examples ? Even better would be to please add a unit 
test for this. I think it should be easy to unit test just the 
lastMigrationInfo. This is just testing that the function is behaving as 
expected by stubbing out other parts. This test would also help with providing 
the necessary guardrails as we change this function in the future. I expect 
this function to change as we production harden the cache/shuffle block 
migration in the near term.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to