This is an automated email from the ASF dual-hosted git repository.

zhouky pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git


The following commit(s) were added to refs/heads/main by this push:
     new bee864842 [CELEBORN-864][DOC] Document on blacklist
bee864842 is described below

commit bee8648421a92c30541e7e9cb70a190b9dc4cc22
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Tue Aug 1 21:23:55 2023 +0800

    [CELEBORN-864][DOC] Document on blacklist
    
    ### What changes were proposed in this pull request?
    As title.
    
    ### Why are the changes needed?
    As title.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manual test.
    
    Closes #1782 from waitinfuture/864.
    
    Lead-authored-by: zky.zhoukeyong <[email protected]>
    Co-authored-by: Keyong Zhou <[email protected]>
    Signed-off-by: zky.zhoukeyong <[email protected]>
---
 docs/developers/workerexclusion.md | 81 ++++++++++++++++++++++++++++++++++++++
 mkdocs.yml                         |  3 +-
 2 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/docs/developers/workerexclusion.md 
b/docs/developers/workerexclusion.md
new file mode 100644
index 000000000..44f7db49d
--- /dev/null
+++ b/docs/developers/workerexclusion.md
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      https://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# Overview
+`Worker`s can fail, temporarily or permanently. To reduce the impact of 
`Worker` failure, Celeborn tries to
+figure out `Worker` status as soon as possible, and as correct as possible. 
This article describes detailed
+design of `Worker` exclusion.
+
+## Participants
+As described [Previously](../../developers/overview#components), Celeborn has 
three components: `Master`, `Worker`,
+and `Client`. `Client` is further separated into `LifecycleManager` and 
`ShuffleClient`. `Master`/`LifecycleManager`
+/`ShuffleClient` need to know about `Worker` status, actively or reactively.
+
+## Master Side Exclusion
+`Master` maintains the ground-truth status of `Worker`s, with relatively 
longer delay. Master maintains four
+lists of `Worker`s with different status:
+
+- Active list. `Worker`s that have successfully registered to `Master`, and 
heartbeat never timed out.
+- Excluded list. `Worker`s that are inside active list, but have no available 
disks for allocating new
+  slots. `Master` recognizes such `Worker`s through heartbeat from `Worker`s.
+- Graceful shutdown list. `Worker`s that are inside active list, but have 
triggered
+  [Graceful Shutdown](../../upgrading). `Master` expects these `Worker`s 
should re-register themselves soon.
+- Lost list. `Worker`s whose heartbeat timed out. These `Worker`s will be 
removed from active and excluded
+  list, but will not be removed from graceful shutdown list.
+
+Upon receiving RequestSlots, `Master` will choose `Worker`s in active list 
subtracting excluded and graceful
+shutdown list. Since `Master` only exclude `Worker`s upon heartbeat, it has 
relative long delay.
+
+## ShuffleClient Side Exclusion
+`ShuffleClient`'s local exclusion list is essential to performance. Say the 
timeout to create network
+connection is 10s, if `ShuffleClient` blindly pushes data to a non-exist 
`Worker`, the task will hang forever.
+
+Waiting for `Master` to inform the exclusion list is unacceptable because of 
the delay. Instead, `ShuffleClient`
+actively exclude `Worker`s when it encounters critical exceptions, for example:
+
+- Fail to create network connection
+- Fail to push data
+- Fail to fetch data
+- Connection exception happened
+
+In addition to exclude the `Worker`s locally, `ShuffleClient` also carries the 
cause of push failure with
+[Revive](../../developers/faulttolerant#handle-pushdata-failure) for 
`LifecycleManager` to also exclude the `Worker`s,
+see the section below.
+
+Such strategy is aggressive, meaning false negative may happen. To rectify, 
`ShuffleClient` removes `Worker`s from
+the excluded list whenever an event happens that indicates some `Worker` is 
available, for example:
+
+- When the `Worker` is allocated slots in register shuffle
+- When `LifecycleManager` says the `Worker` is available in response of Revive
+
+Currently, exclusion list in `ShuffleClient` is optional, users can configure 
using the following configs:
+
+`celeborn.client.push/fetch.excludeWorkerOnFailure.enabled`
+
+## LifecycleManager Side Exclusion 
+The accuracy and delay in `LifecycleManager`'s exclusion list stands between 
`Master` and `Worker`. `LifecyleManager`
+excludes a `Worker` in the following scenarios:
+
+- Receives Revive request and the cause is critical
+- Fail to send RPC to a `Worker`
+- From `Master`'s excluded list, carried in the heartbeat response
+
+`LifecycleManager` will remove `Worker` from the excluded list in the 
following scenarios:
+
+- For critical causes, when timeout expires (defaults to 180s)
+- For non-critical causes, when it's not in `Master`'s exclusion list
\ No newline at end of file
diff --git a/mkdocs.yml b/mkdocs.yml
index ec14dee15..43640a777 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -93,5 +93,4 @@ nav:
         - LifecycleManager: developers/lifecyclemanager.md
         - ShuffleClient: developers/shuffleclient.md
       - Fault Tolerant: developers/faulttolerant.md
-#      - ReadData: developers/readdata.md
-#      - Slots Allocation: developers/slotsallocation.md
+      - Worker Exclusion: developers/workerexclusion.md

Reply via email to