RexXiong commented on code in PR #3099:
URL: https://github.com/apache/celeborn/pull/3099#discussion_r1961286032
##########
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala:
##########
@@ -3740,6 +3741,15 @@ object CelebornConf extends Logging {
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefaultString("10s")
+ val WORKER_RESUME_BY_PINNED_MEMORY_KEEP_TIME: ConfigEntry[Long] =
+ buildConf("celeborn.worker.monitor.resumeByPinnedMemory.keepTime")
Review Comment:
> It's worth considering, but there are some problems.
`keepResumeByPinnedMemory` will continue to resume the channel even if the
memory is now above `pausePushDataThreshold`. This behavior cannot last too
long. The default value of
`celeborn.worker.monitor.pinnedMemory.resume.interval` is 10 seconds, which may
cause oom risk. So I modified the code to add a check:
`keepResumeByPinnedMemory` requires that `pinnedMemory `is less than
`pinnedMemoryResumeRatio`, which will greatly reduce the probability of oom.
@RexXiong WDYT?
Make sense. Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]