This is an automated email from the ASF dual-hosted git repository.
zhouky pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git
The following commit(s) were added to refs/heads/main by this push:
new 27521547f [CELEBORN-823][DOC] Add Celeborn architecture document
27521547f is described below
commit 27521547f09b330d7a25848ef3a3a08162ed4639
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Sat Jul 22 23:57:22 2023 +0800
[CELEBORN-823][DOC] Add Celeborn architecture document
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes #1746 from waitinfuture/823.
Authored-by: zky.zhoukeyong <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
---
docs/assets/img/celeborn.svg | 4 ++
docs/assets/img/ess.svg | 4 ++
docs/developers/client.md | 0
docs/developers/master.md | 0
docs/developers/overview.md | 108 +++++++++++++++++++++++++++++++++++++
docs/developers/readdata.md | 0
docs/developers/slotsallocation.md | 0
docs/developers/worker.md | 0
mkdocs.yml | 9 +++-
9 files changed, 124 insertions(+), 1 deletion(-)
diff --git a/docs/assets/img/celeborn.svg b/docs/assets/img/celeborn.svg
new file mode 100644
index 000000000..1450767a0
--- /dev/null
+++ b/docs/assets/img/celeborn.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than diagrams.net -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="564px"
height="351px" viewBox="-0.5 -0.5 564 351" content="<mxfile
host="app.diagrams.net" modified="2023-07-22T15:32:11.343Z"
agent="5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/114.0.0.0 Safari/537.36"
etag="Z4OxtXCeFjUOn9pnpB-F" version="16.0.0"
type="device"><diagram id=&qu [...]
\ No newline at end of file
diff --git a/docs/assets/img/ess.svg b/docs/assets/img/ess.svg
new file mode 100644
index 000000000..1bf0d4f42
--- /dev/null
+++ b/docs/assets/img/ess.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than diagrams.net -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="331px"
height="161px" viewBox="-0.5 -0.5 331 161" content="<mxfile
host="app.diagrams.net" modified="2023-07-22T15:25:15.267Z"
agent="5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/114.0.0.0 Safari/537.36"
etag="MqgsP8_XmYsonO9UTsSG" version="16.0.0"
type="github"><diagram id=&qu [...]
\ No newline at end of file
diff --git a/docs/developers/client.md b/docs/developers/client.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/master.md b/docs/developers/master.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/overview.md b/docs/developers/overview.md
new file mode 100644
index 000000000..7c49f449f
--- /dev/null
+++ b/docs/developers/overview.md
@@ -0,0 +1,108 @@
+---
+license: |
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+---
+
+# Celeborn Architecture
+
+This article introduces high level Apache Celeborn(Incubating) Architecture.
For more detailed description of each module/process,
+please refer to dedicated articles.
+
+## Why Celeborn
+In distributed compute engines, data exchange between compute nodes is common
but expensive. The cost comes from
+the disk and network inefficiency (M * N between Mappers and Reducers) in
traditional shuffle frame, as following:
+
+
+
+Besides inefficiency, traditional shuffle framework requires large local
storage in compute node to store shuffle
+data, thus blocks the adoption of disaggregated architecture.
+
+Apache Celeborn(Incubating) solves the problems by reorganizing shuffle data
in a more efficient way, and storing the data in
+a separate service. The high level architecture of Celeborn is as follows:
+
+
+
+## Components
+Celeborn(Incubating) has three primary components: Master, Worker, and Client.
+
+- Master manages Celeborn cluster and achieves high availability(HA) based on
Raft.
+- Worker processes read-write requests.
+- Client writes/reads data to/from Celeborn cluster, and manages shuffle
metadata for the application.
+
+In most distributed compute engines, there are typically two roles: one role
for application lifecycle management
+and task orchestration, i.e. `Driver` in Spark and `JobMaster` for Flink; the
other role for executing tasks, i.e.
+`Executor` in Spark and `TaskManager` for Flink.
+
+Similarly, Celeborn Client is also divided into two roles: `LifecycleManager`
for control plane, responsible for
+managing all shuffle metadata for the application; and `ShuffleClient` for
data plane, responsible for write/read
+data to/from Workers.
+
+`LifecycleManager` resides in `Driver` or `JobMaster`, one instance in each
application; `ShuffleClient` resides in
+each `Executor` or `TaskManager`, one instance in each process of
`Executor`/`TaskManager`.
+
+## Shuffle Lifecycle
+A typical lifecycle of a shuffle with Celeborn is as follows:
+
+1. Client sends `RegisterShuffle` to Master. Master allocates slots among
Workers and responds to Client.
+2. Client sends `ReserveSlots` to Workers. Workers reserve slots for the
shuffle and responds to Client.
+3. Clients push data to allocated Workers. Data of the same `partitionId` are
pushed to the same logical `PartitionLocation`.
+4. After all Clients finishes pushing data, Client sends `CommitFiles` to each
Worker. Workers commit data
+ for the shuffle then respond to Client.
+5. Clients send `OpenStream` to Workers for each partition split file to
prepare for reading.
+6. Clients send `ChunkFetchRequest` to Workers to read chunks.
+7. After Client finishes reading data, Client sends `UnregisterShuffle` to
Master to release resources.
+
+## Data Reorganization
+Celeborn improves disk and network efficiency through data reorganization.
Typically, Celeborn stores all shuffle data
+with the same `partitionId` in a logical `PartitionLocation`.
+
+In normal cases each `PartitionLocation` corresponds to a single file. When a
reducer requires for the partition's data,
+it just needs one network connection and sequentially read the coarse grained
file.
+
+In abnormal cases, such as when the file grows too large, or push data fails,
Celeborn spawns a new split of the
+`PartitionLocation`, and future data within the partition will be pushed to
the new split.
+
+Client keeps the split information and tells reducer to read from all splits
of the `PartitionLocation` to guarantee
+no data is lost.
+
+## Data Storage
+Celeborn stores shuffle data in configurable multiple layers, i.e. `Memroy`,
`Local Disks`, `Distributed File System`,
+and `Object Store`. Users can specify any combination of the layers on each
Worker.
+
+Currently, Celeborn only supports `Local Disks` and `HDFS`. Supporting for
other storage systems are under working.
+
+## Compute Engine Integration
+Celeborn's primary components(i.e. Master, Worker, Client) are engine
irrelevant. The Client APIs are extensible
+and easy to implement plugins for various engines.
+
+Currently, Celeborn officially supports
[Spark](https://spark.apache.org/)(both Spark 2.x and Spark 3.x),
+[Flink](https://flink.apache.org/)(1.14/1.15/1.17), and
+[Gluten](https://github.com/oap-project/gluten). Also developers are
integrating Celeborn with other engines,
+for example [MR3](https://mr3docs.datamonad.com/docs/mr3/).
+
+Celeborn community is also working on integrating Celeborn with other engines.
+
+## Graceful Shutdown
+In order not to impact running applications when upgrading Celeborn Cluster,
Celeborn implements Graceful Upgrade.
+
+When graceful shutdown is turned on, upon shutdown, Celeborn will do the
following things:
+
+1. Master will not allocate slots on the Worker
+2. Worker will inform Clients to split
+3. Client will send `CommitFiles` to the Worker
+
+Then the Worker waits until all `PartitionLocation` flushes data to persistent
storage, stores states in local leveldb,
+then stops itself. The process is typically within one minute.
+
+For more details, please refer to [Rolling upgrade](/upgrade)
\ No newline at end of file
diff --git a/docs/developers/readdata.md b/docs/developers/readdata.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/slotsallocation.md
b/docs/developers/slotsallocation.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/worker.md b/docs/developers/worker.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/mkdocs.yml b/mkdocs.yml
index ce1342527..8cfd67370 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -80,4 +80,11 @@ nav:
- Ratis Shell: celeborn_ratis_shell.md
- Monitoring: monitoring.md
- Migration Guide: migration.md
-
+ - Developers Doc:
+ - Overview: developers/overview.md
+ - Master: developers/master.md
+ - Worker: developers/worker.md
+ - Client: developers/client.md
+ - PushData: developers/pushdata.md
+ - ReadData: developers/readdata.md
+ - Slots Allocation: developers/slotsallocation.md