This is an automated email from the ASF dual-hosted git repository.
zhouky pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git
The following commit(s) were added to refs/heads/main by this push:
new 9ec223edd [CELEBORN-803] Increase default timeout for commit files
9ec223edd is described below
commit 9ec223edd7a99c71d74a0f9c64595219e8359d20
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Mon Jul 17 22:31:36 2023 +0800
[CELEBORN-803] Increase default timeout for commit files
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```,
which is 240s.
It's more reasonable because commit files costs relatively long time. In my
testing with tough disks,
30s timeout with 2 retires is not enough.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Passes GA and manual test.
Closes #1724 from waitinfuture/803.
Authored-by: zky.zhoukeyong <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
---
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala | 5 +++--
docs/configuration/client.md | 2 +-
docs/configuration/worker.md | 2 +-
3 files changed, 5 insertions(+), 4 deletions(-)
diff --git
a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
index f9dfba1d6..bedf5f3b4 100644
--- a/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
+++ b/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
@@ -2147,7 +2147,8 @@ object CelebornConf extends Logging {
.doc("Timeout for a Celeborn worker to commit files of a shuffle. " +
"It's recommended to set at least `240s` when `HDFS` is enabled in
`celeborn.storage.activeTypes`.")
.version("0.3.0")
- .fallbackConf(RPC_ASK_TIMEOUT)
+ .timeConf(TimeUnit.MILLISECONDS)
+ .createWithDefaultString("120s")
val PARTITION_SORTER_SORT_TIMEOUT: ConfigEntry[Long] =
buildConf("celeborn.worker.sortPartition.timeout")
@@ -3190,7 +3191,7 @@ object CelebornConf extends Logging {
.version("0.3.0")
.intConf
.checkValue(v => v > 0, "value must be positive")
- .createWithDefault(2)
+ .createWithDefault(3)
val CLIENT_COMMIT_IGNORE_EXCLUDED_WORKERS: ConfigEntry[Boolean] =
buildConf("celeborn.client.commitFiles.ignoreExcludedWorker")
diff --git a/docs/configuration/client.md b/docs/configuration/client.md
index c15138700..2a92c35e3 100644
--- a/docs/configuration/client.md
+++ b/docs/configuration/client.md
@@ -62,7 +62,7 @@ license: |
| celeborn.client.push.timeout | 120s | Timeout for a task to push data rpc
message. This value should better be more than twice of
`celeborn.<module>.push.timeoutCheck.interval` | 0.3.0 |
| celeborn.client.registerShuffle.maxRetries | 3 | Max retry times for client
to register shuffle. | 0.3.0 |
| celeborn.client.registerShuffle.retryWait | 3s | Wait time before next retry
if register shuffle failed. | 0.3.0 |
-| celeborn.client.requestCommitFiles.maxRetries | 2 | Max retry times for
requestCommitFiles RPC. | 0.3.0 |
+| celeborn.client.requestCommitFiles.maxRetries | 3 | Max retry times for
requestCommitFiles RPC. | 0.3.0 |
| celeborn.client.reserveSlots.maxRetries | 3 | Max retry times for client to
reserve slots. | 0.3.0 |
| celeborn.client.reserveSlots.rackware.enabled | false | Whether need to
place different replicates on different racks when allocating slots. | 0.3.0 |
| celeborn.client.reserveSlots.retryWait | 3s | Wait time before next retry if
reserve slots failed. | 0.3.0 |
diff --git a/docs/configuration/worker.md b/docs/configuration/worker.md
index 372f91cf4..7d37707f2 100644
--- a/docs/configuration/worker.md
+++ b/docs/configuration/worker.md
@@ -27,7 +27,7 @@ license: |
| celeborn.worker.bufferStream.threadsPerMountpoint | 8 | Threads count for
read buffer per mount point. | 0.3.0 |
| celeborn.worker.closeIdleConnections | false | Whether worker will close
idle connections. | 0.2.0 |
| celeborn.worker.commitFiles.threads | 32 | Thread number of worker to commit
shuffle data files asynchronously. It's recommended to set at least `128` when
`HDFS` is enabled in `celeborn.storage.activeTypes`. | 0.3.0 |
-| celeborn.worker.commitFiles.timeout | <value of
celeborn.rpc.askTimeout> | Timeout for a Celeborn worker to commit files of
a shuffle. It's recommended to set at least `240s` when `HDFS` is enabled in
`celeborn.storage.activeTypes`. | 0.3.0 |
+| celeborn.worker.commitFiles.timeout | 120s | Timeout for a Celeborn worker
to commit files of a shuffle. It's recommended to set at least `240s` when
`HDFS` is enabled in `celeborn.storage.activeTypes`. | 0.3.0 |
| celeborn.worker.congestionControl.enabled | false | Whether to enable
congestion control or not. | 0.3.0 |
| celeborn.worker.congestionControl.high.watermark | <undefined> | If
the total bytes in disk buffer exceeds this configure, will start to
congestusers whose produce rate is higher than the potential average consume
rate. The congestion will stop if the produce rate is lower or equal to the
average consume rate, or the total pending bytes lower than
celeborn.worker.congestionControl.low.watermark | 0.3.0 |
| celeborn.worker.congestionControl.low.watermark | <undefined> | Will
stop congest users if the total pending bytes of disk buffer is lower than this
configuration | 0.3.0 |