This is an automated email from the ASF dual-hosted git repository. nicholasjiang pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/celeborn.git
commit 7244f181e1ebc3fb2633842c6cc5e1873a7c58c5 Author: SteNicholas <[email protected]> AuthorDate: Tue Nov 4 15:23:06 2025 +0800 [CELEBORN-2194] Change default value of celeborn.worker.directMemoryRatioForReadBuffer ### What changes were proposed in this pull request? Change default value of `celeborn.worker.directMemoryRatioForReadBuffer` from 0.1 to 0.35. ### Why are the changes needed? The default value of `celeborn.worker.directMemoryRatioForReadBuffer` is 0.1, which is too small to cause a backlog of read buffer requests in `ReadBufferDispacther`. Therefore, `celeborn.worker.directMemoryRatioForReadBuffer` should be changed from `0.1` to `0.35` which is production practice value to raise read buffer threshold of `ReadBufferDispatcher`. ### Does this PR resolve a correctness bug? No. ### Does this PR introduce _any_ user-facing change? The default value of `celeborn.worker.directMemoryRatioForReadBuffer` is changed to 0.35. ### How was this patch tested? CI. Closes #3527 from SteNicholas/CELEBORN-2194. Authored-by: SteNicholas <[email protected]> Signed-off-by: SteNicholas <[email protected]> --- common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala | 2 +- docs/configuration/worker.md | 2 +- docs/migration.md | 2 ++ 3 files changed, 4 insertions(+), 2 deletions(-) diff --git a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala index 010cd57a4..a39cf5797 100644 --- a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala +++ b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala @@ -4153,7 +4153,7 @@ object CelebornConf extends Logging { .doc("Max ratio of direct memory for read buffer") .version("0.2.0") .doubleConf - .createWithDefault(0.1) + .createWithDefault(0.35) val WORKER_DIRECT_MEMORY_RATIO_FOR_MEMORY_FILE_STORAGE: ConfigEntry[Double] = buildConf("celeborn.worker.directMemoryRatioForMemoryFileStorage") diff --git a/docs/configuration/worker.md b/docs/configuration/worker.md index f4f4744d4..66468c749 100644 --- a/docs/configuration/worker.md +++ b/docs/configuration/worker.md @@ -77,7 +77,7 @@ license: | | celeborn.worker.decommission.checkInterval | 30s | false | The wait interval of checking whether all the shuffle expired during worker decommission | 0.4.0 | | | celeborn.worker.decommission.forceExitTimeout | 6h | false | The wait time of waiting for all the shuffle expire during worker decommission. | 0.4.0 | | | celeborn.worker.directMemoryRatioForMemoryFileStorage | 0.0 | false | Max ratio of direct memory to store shuffle data. This feature is experimental and disabled by default. | 0.5.0 | | -| celeborn.worker.directMemoryRatioForReadBuffer | 0.1 | false | Max ratio of direct memory for read buffer | 0.2.0 | | +| celeborn.worker.directMemoryRatioForReadBuffer | 0.35 | false | Max ratio of direct memory for read buffer | 0.2.0 | | | celeborn.worker.directMemoryRatioToPauseReceive | 0.85 | false | If direct memory usage reaches this limit, the worker will stop to receive data from Celeborn shuffle clients. | 0.2.0 | | | celeborn.worker.directMemoryRatioToPauseReplicate | 0.95 | false | If direct memory usage reaches this limit, the worker will stop to receive replication data from other workers. This value should be higher than celeborn.worker.directMemoryRatioToPauseReceive. | 0.2.0 | | | celeborn.worker.directMemoryRatioToResume | 0.7 | false | If direct memory usage is less than this limit, worker will resume. | 0.2.0 | | diff --git a/docs/migration.md b/docs/migration.md index 2c302a534..7e0385f36 100644 --- a/docs/migration.md +++ b/docs/migration.md @@ -31,6 +31,8 @@ license: | - Since 0.7.0, Celeborn changed the default value of `celeborn.<module>.io.mode` from `NIO` to `KQUEUE` if kqueue mode is available, falling back to `NIO` otherwise. +- Since 0.7.0, Celeborn changed the default value of `celeborn.worker.directMemoryRatioForReadBuffer` from `0.1` to `0.35`, which means read buffer threshold of buffer dispatcher is max direct memory * 0.35 at default. + # Upgrading from 0.5 to 0.6 - Since 0.6.0, Celeborn deprecate `celeborn.client.spark.fetch.throwsFetchFailure`. Please use `celeborn.client.spark.stageRerun.enabled` instead.
