TheodoreLx commented on code in PR #3099:
URL: https://github.com/apache/celeborn/pull/3099#discussion_r1959555924


##########
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala:
##########
@@ -3740,6 +3741,15 @@ object CelebornConf extends Logging {
       .timeConf(TimeUnit.MILLISECONDS)
       .createWithDefaultString("10s")
 
+  val WORKER_RESUME_BY_PINNED_MEMORY_KEEP_TIME: ConfigEntry[Long] =
+    buildConf("celeborn.worker.monitor.resumeByPinnedMemory.keepTime")

Review Comment:
   It's worth considering, but there are some problems. 
`keepResumeByPinnedMemory` will continue to resume the channel even if the 
memory is now above `pausePushDataThreshold`. 
   This behavior cannot last too long. The default value of 
`celeborn.worker.monitor.pinnedMemory.resume.interval` is 10 seconds, which may 
cause oom risk. 
   So I modified the code to add a check: `keepResumeByPinnedMemory` requires 
that `pinnedMemory `is less than `pinnedMemoryResumeRatio`, which will greatly 
reduce the probability of oom. @RexXiong WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to