This is an automated email from the ASF dual-hosted git repository.

zhouky pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git


The following commit(s) were added to refs/heads/main by this push:
     new 27521547f [CELEBORN-823][DOC] Add Celeborn architecture document
27521547f is described below

commit 27521547f09b330d7a25848ef3a3a08162ed4639
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Sat Jul 22 23:57:22 2023 +0800

    [CELEBORN-823][DOC] Add Celeborn architecture document
    
    ### What changes were proposed in this pull request?
    As title.
    
    ### Why are the changes needed?
    As title.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    No.
    
    Closes #1746 from waitinfuture/823.
    
    Authored-by: zky.zhoukeyong <[email protected]>
    Signed-off-by: zky.zhoukeyong <[email protected]>
---
 docs/assets/img/celeborn.svg       |   4 ++
 docs/assets/img/ess.svg            |   4 ++
 docs/developers/client.md          |   0
 docs/developers/master.md          |   0
 docs/developers/overview.md        | 108 +++++++++++++++++++++++++++++++++++++
 docs/developers/readdata.md        |   0
 docs/developers/slotsallocation.md |   0
 docs/developers/worker.md          |   0
 mkdocs.yml                         |   9 +++-
 9 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/docs/assets/img/celeborn.svg b/docs/assets/img/celeborn.svg
new file mode 100644
index 000000000..1450767a0
--- /dev/null
+++ b/docs/assets/img/celeborn.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than diagrams.net -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" 
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd";>
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; version="1.1" width="564px" 
height="351px" viewBox="-0.5 -0.5 564 351" content="&lt;mxfile 
host=&quot;app.diagrams.net&quot; modified=&quot;2023-07-22T15:32:11.343Z&quot; 
agent=&quot;5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/114.0.0.0 Safari/537.36&quot; 
etag=&quot;Z4OxtXCeFjUOn9pnpB-F&quot; version=&quot;16.0.0&quot; 
type=&quot;device&quot;&gt;&lt;diagram id=&qu [...]
\ No newline at end of file
diff --git a/docs/assets/img/ess.svg b/docs/assets/img/ess.svg
new file mode 100644
index 000000000..1bf0d4f42
--- /dev/null
+++ b/docs/assets/img/ess.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than diagrams.net -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" 
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd";>
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; version="1.1" width="331px" 
height="161px" viewBox="-0.5 -0.5 331 161" content="&lt;mxfile 
host=&quot;app.diagrams.net&quot; modified=&quot;2023-07-22T15:25:15.267Z&quot; 
agent=&quot;5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/114.0.0.0 Safari/537.36&quot; 
etag=&quot;MqgsP8_XmYsonO9UTsSG&quot; version=&quot;16.0.0&quot; 
type=&quot;github&quot;&gt;&lt;diagram id=&qu [...]
\ No newline at end of file
diff --git a/docs/developers/client.md b/docs/developers/client.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/master.md b/docs/developers/master.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/overview.md b/docs/developers/overview.md
new file mode 100644
index 000000000..7c49f449f
--- /dev/null
+++ b/docs/developers/overview.md
@@ -0,0 +1,108 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# Celeborn Architecture
+
+This article introduces high level Apache Celeborn(Incubating) Architecture. 
For more detailed description of each module/process,
+please refer to dedicated articles.
+
+## Why Celeborn
+In distributed compute engines, data exchange between compute nodes is common 
but expensive. The cost comes from
+the disk and network inefficiency (M * N between Mappers and Reducers) in 
traditional shuffle frame, as following:
+
+![ESS](/assets/img/ess.svg)
+
+Besides inefficiency, traditional shuffle framework requires large local 
storage in compute node to store shuffle
+data, thus blocks the adoption of disaggregated architecture.
+
+Apache Celeborn(Incubating) solves the problems by reorganizing shuffle data 
in a more efficient way, and storing the data in
+a separate service. The high level architecture of Celeborn is as follows:
+
+![Celeborn](/assets/img/celeborn.svg)
+
+## Components
+Celeborn(Incubating) has three primary components: Master, Worker, and Client.
+
+- Master manages Celeborn cluster and achieves high availability(HA) based on 
Raft.
+- Worker processes read-write requests.
+- Client writes/reads data to/from Celeborn cluster, and manages shuffle 
metadata for the application.
+
+In most distributed compute engines, there are typically two roles: one role 
for application lifecycle management
+and task orchestration, i.e. `Driver` in Spark and `JobMaster` for Flink; the 
other role for executing tasks, i.e.
+`Executor` in Spark and `TaskManager` for Flink.
+
+Similarly, Celeborn Client is also divided into two roles: `LifecycleManager` 
for control plane, responsible for
+managing all shuffle metadata for the application; and `ShuffleClient` for 
data plane, responsible for write/read
+data to/from Workers.
+
+`LifecycleManager` resides in `Driver` or `JobMaster`, one instance in each 
application; `ShuffleClient` resides in
+each `Executor` or `TaskManager`, one instance in each process of 
`Executor`/`TaskManager`.
+
+## Shuffle Lifecycle
+A typical lifecycle of a shuffle with Celeborn is as follows:
+
+1. Client sends `RegisterShuffle` to Master. Master allocates slots among 
Workers and responds to Client.
+2. Client sends `ReserveSlots` to Workers. Workers reserve slots for the 
shuffle and responds to Client.
+3. Clients push data to allocated Workers. Data of the same `partitionId` are 
pushed to the same logical `PartitionLocation`.
+4. After all Clients finishes pushing data, Client sends `CommitFiles` to each 
Worker. Workers commit data
+   for the shuffle then respond to Client.
+5. Clients send `OpenStream` to Workers for each partition split file to 
prepare for reading.
+6. Clients send `ChunkFetchRequest` to Workers to read chunks.
+7. After Client finishes reading data, Client sends `UnregisterShuffle` to 
Master to release resources.
+
+## Data Reorganization
+Celeborn improves disk and network efficiency through data reorganization. 
Typically, Celeborn stores all shuffle data
+with the same `partitionId` in a logical `PartitionLocation`.
+
+In normal cases each `PartitionLocation` corresponds to a single file. When a 
reducer requires for the partition's data,
+it just needs one network connection and sequentially read the coarse grained 
file.
+
+In abnormal cases, such as when the file grows too large, or push data fails, 
Celeborn spawns a new split of the
+`PartitionLocation`, and future data within the partition will be pushed to 
the new split.
+
+Client keeps the split information and tells reducer to read from all splits 
of the `PartitionLocation` to guarantee
+no data is lost.
+
+## Data Storage
+Celeborn stores shuffle data in configurable multiple layers, i.e. `Memroy`, 
`Local Disks`, `Distributed File System`,
+and `Object Store`. Users can specify any combination of the layers on each 
Worker.
+
+Currently, Celeborn only supports `Local Disks` and `HDFS`. Supporting for 
other storage systems are under working.
+
+## Compute Engine Integration
+Celeborn's primary components(i.e. Master, Worker, Client) are engine 
irrelevant. The Client APIs are extensible
+and easy to implement plugins for various engines.
+
+Currently, Celeborn officially supports 
[Spark](https://spark.apache.org/)(both Spark 2.x and Spark 3.x),
+[Flink](https://flink.apache.org/)(1.14/1.15/1.17), and
+[Gluten](https://github.com/oap-project/gluten). Also developers are 
integrating Celeborn with other engines,
+for example [MR3](https://mr3docs.datamonad.com/docs/mr3/).
+
+Celeborn community is also working on integrating Celeborn with other engines.
+
+## Graceful Shutdown
+In order not to impact running applications when upgrading Celeborn Cluster, 
Celeborn implements Graceful Upgrade.
+
+When graceful shutdown is turned on, upon shutdown, Celeborn will do the 
following things:
+
+1. Master will not allocate slots on the Worker
+2. Worker will inform Clients to split
+3. Client will send `CommitFiles` to the Worker
+
+Then the Worker waits until all `PartitionLocation` flushes data to persistent 
storage, stores states in local leveldb,
+then stops itself. The process is typically within one minute.
+
+For more details, please refer to [Rolling upgrade](/upgrade)
\ No newline at end of file
diff --git a/docs/developers/readdata.md b/docs/developers/readdata.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/slotsallocation.md 
b/docs/developers/slotsallocation.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/developers/worker.md b/docs/developers/worker.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/mkdocs.yml b/mkdocs.yml
index ce1342527..8cfd67370 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -80,4 +80,11 @@ nav:
       - Ratis Shell: celeborn_ratis_shell.md
   - Monitoring: monitoring.md
   - Migration Guide: migration.md
-
+  - Developers Doc:
+      - Overview: developers/overview.md
+      - Master: developers/master.md
+      - Worker: developers/worker.md
+      - Client: developers/client.md
+      - PushData: developers/pushdata.md
+      - ReadData: developers/readdata.md
+      - Slots Allocation: developers/slotsallocation.md

Reply via email to