This is an automated email from the ASF dual-hosted git repository.
zhouky pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git
The following commit(s) were added to refs/heads/main by this push:
new b36ea3900 [CELEBORN-834][DOC] Add fault tolerant document
b36ea3900 is described below
commit b36ea39001d727d428fd1799bdc26e91ebe67a6f
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Fri Jul 28 10:39:08 2023 +0800
[CELEBORN-834][DOC] Add fault tolerant document
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes #1769 from waitinfuture/834.
Authored-by: zky.zhoukeyong <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
---
docs/assets/img/batchrevive.svg | 4 ++
docs/assets/img/revive.svg | 4 ++
docs/developers/faulttolerant.md | 93 ++++++++++++++++++++++++++++++++++++++++
docs/developers/storage.md | 9 +++-
mkdocs.yml | 1 +
5 files changed, 110 insertions(+), 1 deletion(-)
diff --git a/docs/assets/img/batchrevive.svg b/docs/assets/img/batchrevive.svg
new file mode 100644
index 000000000..e4c2d9af6
--- /dev/null
+++ b/docs/assets/img/batchrevive.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than draw.io -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="577px"
height="91px" viewBox="-0.5 -0.5 577 91" content="<mxfile
host="app.diagrams.net" modified="2023-07-27T07:25:14.841Z"
agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
etag="BQb3A-uZjqwxHBKLMWdC" version="21.6.5"
type="device"> <d [...]
\ No newline at end of file
diff --git a/docs/assets/img/revive.svg b/docs/assets/img/revive.svg
new file mode 100644
index 000000000..581b2722b
--- /dev/null
+++ b/docs/assets/img/revive.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than draw.io -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="405px"
height="173px" viewBox="-0.5 -0.5 405 173" content="<mxfile
host="app.diagrams.net" modified="2023-07-27T07:12:16.950Z"
agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
etag="rP4WtZZUbo8I_TjyoYCL" version="21.6.5"
type="device"> < [...]
\ No newline at end of file
diff --git a/docs/developers/faulttolerant.md b/docs/developers/faulttolerant.md
new file mode 100644
index 000000000..a12ab6d92
--- /dev/null
+++ b/docs/developers/faulttolerant.md
@@ -0,0 +1,93 @@
+---
+license: |
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ https://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+---
+
+# Fault Tolerant
+This article describes the detailed design of Celeborn's fault-tolerant.
+
+In addition to data replication to handle `Worker` lost, Celeborn tries to
handle exceptions during shuffle
+as much as possible, especially the following:
+
+- When `PushData`/`PushMergedData` fail
+- When fetch chunk fails
+- When disk is unhealthy or reaching limit
+
+This article is based on
[ReducePartition](../../developers/storage#reducepartition).
+
+## Handle PushData Failure
+The detailed description of push data can be found in
[PushData](../../developers/pushdata). Push data can fail for
+various reasons, i.e. CPU high load, network fluctuation, JVM GC, `Worker`
lost.
+
+Celeborn does not eagerly consider `Worker` lost when push data fails, instead
it considers it as temporary
+unavailable, and asks for another (pair of) `PartitionLocation`(s) on
different `Worker`(s) to continue pushing.
+The process is called `Revive`:
+
+
+
+Handling [PushMergedData](../../developers/pushdata#push-or-merge) failure is
similar but more complex. Currently,
+`PushMergedData` is in all-or-nothing fashion, meaning either all data batches
in the request succeed or all fail.
+Partial success is not supported yet.
+
+Upon `PushMergedData` failure, `ShuffleClient` first unpacks and revives for
every data batch. Notice that previously
+all data batches in `PushMergedData` have the same primary and replica (if
any) destination, after reviving new
+`PartitionLocation`s can spread across multiple `Worker`s.
+
+Then `ShuffleClient` groups the new `PartitionLocations` in the same way as
before, resulting in multiple
+`PushMergedData` requests, then send them to their destinations.
+
+Celeborn detects data lost when processing `CommitFiles` (See
[Worker](../..developers/overview#shuffle-lifecycle)).
+Celeborn considers no `DataLost` if and only if every `PartitionLocation` has
succeeded to commit at least one replica
+(if replication is turned off, there is only one replica for each
`PartitionLocation`).
+
+When a `Worker` is down, all `PartitionLocation`s on the `Worker` will be
revived, causing `Revive` RPC flood
+to `LifecycleManager`. To alleviate this, `ShuffleClient` batches all `Revive`
requests before sending to
+`LifecycleManager`:
+
+
+
+## Handle Fetch Failure
+As [ReducePartition](../../developers/storage#reducepartition) describes, data
file consists of chunks, `ShuffleClient`
+asks for a chunk once a time.
+
+`ShuffleClient` defines the max number of retries for each replica(defaults to
3). When fetch chunk fails,
+`ShuffleClient` will try another replica (in case where replication is off,
retry the same one).
+
+If the max retry number exceeds, `ShuffleClient` gives up retrying and throws
Exception.
+
+## Disk Check
+`Worker` periodically checks disk health and usage. When health check fails,
`Worker` isolates the disk and will
+not allocate slots on it until it becomes healthy again.
+
+Similarly, if usable space goes less than threshold (defaults to 5GiB),
`Worker` will not allocate slots on it. In
+addition, to avoid exceeding space, `Worker` will trigger `HARD_SPLIT` for all
`PartitionLocation`s on the disk to
+avoid file size growth.
+
+## Exactly Once
+It can happen that `Worker` successfully receives and writes a data batch but
fails to send ACK to `ShuffleClient`, or
+primary successfully receives and writes a data batch but replica fails. Also,
different task attempts
+(i.e. speculative execution) will push the same data twice.
+
+In a word, it can happen that the same data batch are duplicated across
`PartitionLocation` splits. To guarantee
+exactly once, Celeborn ensures no data is lost, and no duplicate read:
+
+- For each data batch, `ShuffleClient` adds a `(Map Id, Attempt Id, Batch Id)`
header, in which
+ `Batch Id` is a unique id for the data batch in the map attempt
+- `LifecycleManager` keeps all `PartitionLocation`s with the same partition id
+- For each `PartitionLocation` split, at least one replica is successfully
committed before shuffle read
+- `LifecycleManager` records the successful task attempt for each map id, and
only data from that attempt is read
+ for the map id
+- `ShuffleClient` discards data batches with a batch id that it has already
read
diff --git a/docs/developers/storage.md b/docs/developers/storage.md
index 2cdd699e4..e5915ae64 100644
--- a/docs/developers/storage.md
+++ b/docs/developers/storage.md
@@ -6,7 +6,9 @@ license: |
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
+
+ https://www.apache.org/licenses/LICENSE-2.0
+
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -42,6 +44,11 @@ The layout of `ReducePartition` is as follows:
which points to start positions of each chunk. Upon requesting data from some
partition, `Worker` first returns the
index, then sequentially reads and returns a chunk upon each
`ChunkFetchRequest`, which is very efficient.
+Notice that chunk boundaries is simply decided by the current chunk's size. In
case of replication, since the
+order of data batch arrival is not guaranteed to be the same for primary and
replica, chunks with the same chunk
+index will probably contain different data in primary and replica.
Nevertheless, the whole files in primary and
+replica contain the same data batches in normal cases.
+
#### MapPartition
The layout of `MapPartition` is as follows:
diff --git a/mkdocs.yml b/mkdocs.yml
index c3cf6629c..c5ccf3159 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -90,5 +90,6 @@ nav:
- Traffic Control: developers/trafficcontrol.md
# - Client: developers/client.md
- PushData: developers/pushdata.md
+ - Fault Tolerant: developers/faulttolerant.md
# - ReadData: developers/readdata.md
# - Slots Allocation: developers/slotsallocation.md