This is an automated email from the ASF dual-hosted git repository.
zhouky pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git
The following commit(s) were added to refs/heads/main by this push:
new bee864842 [CELEBORN-864][DOC] Document on blacklist
bee864842 is described below
commit bee8648421a92c30541e7e9cb70a190b9dc4cc22
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Tue Aug 1 21:23:55 2023 +0800
[CELEBORN-864][DOC] Document on blacklist
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes #1782 from waitinfuture/864.
Lead-authored-by: zky.zhoukeyong <[email protected]>
Co-authored-by: Keyong Zhou <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
---
docs/developers/workerexclusion.md | 81 ++++++++++++++++++++++++++++++++++++++
mkdocs.yml | 3 +-
2 files changed, 82 insertions(+), 2 deletions(-)
diff --git a/docs/developers/workerexclusion.md
b/docs/developers/workerexclusion.md
new file mode 100644
index 000000000..44f7db49d
--- /dev/null
+++ b/docs/developers/workerexclusion.md
@@ -0,0 +1,81 @@
+---
+license: |
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ https://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+---
+
+# Overview
+`Worker`s can fail, temporarily or permanently. To reduce the impact of
`Worker` failure, Celeborn tries to
+figure out `Worker` status as soon as possible, and as correct as possible.
This article describes detailed
+design of `Worker` exclusion.
+
+## Participants
+As described [Previously](../../developers/overview#components), Celeborn has
three components: `Master`, `Worker`,
+and `Client`. `Client` is further separated into `LifecycleManager` and
`ShuffleClient`. `Master`/`LifecycleManager`
+/`ShuffleClient` need to know about `Worker` status, actively or reactively.
+
+## Master Side Exclusion
+`Master` maintains the ground-truth status of `Worker`s, with relatively
longer delay. Master maintains four
+lists of `Worker`s with different status:
+
+- Active list. `Worker`s that have successfully registered to `Master`, and
heartbeat never timed out.
+- Excluded list. `Worker`s that are inside active list, but have no available
disks for allocating new
+ slots. `Master` recognizes such `Worker`s through heartbeat from `Worker`s.
+- Graceful shutdown list. `Worker`s that are inside active list, but have
triggered
+ [Graceful Shutdown](../../upgrading). `Master` expects these `Worker`s
should re-register themselves soon.
+- Lost list. `Worker`s whose heartbeat timed out. These `Worker`s will be
removed from active and excluded
+ list, but will not be removed from graceful shutdown list.
+
+Upon receiving RequestSlots, `Master` will choose `Worker`s in active list
subtracting excluded and graceful
+shutdown list. Since `Master` only exclude `Worker`s upon heartbeat, it has
relative long delay.
+
+## ShuffleClient Side Exclusion
+`ShuffleClient`'s local exclusion list is essential to performance. Say the
timeout to create network
+connection is 10s, if `ShuffleClient` blindly pushes data to a non-exist
`Worker`, the task will hang forever.
+
+Waiting for `Master` to inform the exclusion list is unacceptable because of
the delay. Instead, `ShuffleClient`
+actively exclude `Worker`s when it encounters critical exceptions, for example:
+
+- Fail to create network connection
+- Fail to push data
+- Fail to fetch data
+- Connection exception happened
+
+In addition to exclude the `Worker`s locally, `ShuffleClient` also carries the
cause of push failure with
+[Revive](../../developers/faulttolerant#handle-pushdata-failure) for
`LifecycleManager` to also exclude the `Worker`s,
+see the section below.
+
+Such strategy is aggressive, meaning false negative may happen. To rectify,
`ShuffleClient` removes `Worker`s from
+the excluded list whenever an event happens that indicates some `Worker` is
available, for example:
+
+- When the `Worker` is allocated slots in register shuffle
+- When `LifecycleManager` says the `Worker` is available in response of Revive
+
+Currently, exclusion list in `ShuffleClient` is optional, users can configure
using the following configs:
+
+`celeborn.client.push/fetch.excludeWorkerOnFailure.enabled`
+
+## LifecycleManager Side Exclusion
+The accuracy and delay in `LifecycleManager`'s exclusion list stands between
`Master` and `Worker`. `LifecyleManager`
+excludes a `Worker` in the following scenarios:
+
+- Receives Revive request and the cause is critical
+- Fail to send RPC to a `Worker`
+- From `Master`'s excluded list, carried in the heartbeat response
+
+`LifecycleManager` will remove `Worker` from the excluded list in the
following scenarios:
+
+- For critical causes, when timeout expires (defaults to 180s)
+- For non-critical causes, when it's not in `Master`'s exclusion list
\ No newline at end of file
diff --git a/mkdocs.yml b/mkdocs.yml
index ec14dee15..43640a777 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -93,5 +93,4 @@ nav:
- LifecycleManager: developers/lifecyclemanager.md
- ShuffleClient: developers/shuffleclient.md
- Fault Tolerant: developers/faulttolerant.md
-# - ReadData: developers/readdata.md
-# - Slots Allocation: developers/slotsallocation.md
+ - Worker Exclusion: developers/workerexclusion.md