This is an automated email from the ASF dual-hosted git repository.

zhouky pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git


The following commit(s) were added to refs/heads/main by this push:
     new 070d8bc0f [CELEBORN-826][DOC] Add storage document
070d8bc0f is described below

commit 070d8bc0f8572cfa4c8b16e6a47e705eb9c756f4
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Mon Jul 24 16:12:42 2023 +0800

    [CELEBORN-826][DOC] Add storage document
    
    ### What changes were proposed in this pull request?
    As title.
    
    ### Why are the changes needed?
    As title.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    No.
    
    Closes #1752 from waitinfuture/826.
    
    Authored-by: zky.zhoukeyong <[email protected]>
    Signed-off-by: zky.zhoukeyong <[email protected]>
---
 docs/assets/img/mappartition.svg    |   4 ++
 docs/assets/img/multilayer.svg      |   4 ++
 docs/assets/img/reducepartition.svg |   4 ++
 docs/developers/pushdata.md         |   1 +
 docs/developers/storage.md          | 128 ++++++++++++++++++++++++++++++++++++
 docs/developers/worker.md           |  31 +++++++++
 mkdocs.yml                          |  12 ++--
 7 files changed, 179 insertions(+), 5 deletions(-)

diff --git a/docs/assets/img/mappartition.svg b/docs/assets/img/mappartition.svg
new file mode 100644
index 000000000..e70da9c49
--- /dev/null
+++ b/docs/assets/img/mappartition.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than draw.io -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" 
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd";>
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; version="1.1" width="488px" 
height="178px" viewBox="-0.5 -0.5 488 178" content="&lt;mxfile 
host=&quot;app.diagrams.net&quot; modified=&quot;2023-07-24T06:59:46.339Z&quot; 
agent=&quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36&quot; 
etag=&quot;IDSWAViEHxnfNJGXZTIJ&quot; version=&quot;21.6.5&quot; 
type=&quot;device&quot;&gt;&#10;  &lt [...]
\ No newline at end of file
diff --git a/docs/assets/img/multilayer.svg b/docs/assets/img/multilayer.svg
new file mode 100644
index 000000000..af2c08671
--- /dev/null
+++ b/docs/assets/img/multilayer.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than diagrams.net -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" 
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd";>
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; version="1.1" width="988px" 
height="231px" viewBox="-0.5 -0.5 988 231" content="&lt;mxfile 
host=&quot;app.diagrams.net&quot; modified=&quot;2023-07-24T04:08:38.756Z&quot; 
agent=&quot;5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/114.0.0.0 Safari/537.36&quot; 
etag=&quot;9ykfdIQmzQp8y21vunwz&quot; version=&quot;16.0.0&quot; 
type=&quot;device&quot;&gt;&lt;diagram id=&qu [...]
\ No newline at end of file
diff --git a/docs/assets/img/reducepartition.svg 
b/docs/assets/img/reducepartition.svg
new file mode 100644
index 000000000..c9dcf13c4
--- /dev/null
+++ b/docs/assets/img/reducepartition.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than draw.io -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" 
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd";>
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; version="1.1" width="388px" 
height="105px" viewBox="-0.5 -0.5 388 105" content="&lt;mxfile 
host=&quot;app.diagrams.net&quot; modified=&quot;2023-07-24T06:37:26.285Z&quot; 
agent=&quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36&quot; 
etag=&quot;o-vHMtNLAAEOaQ6fgHxS&quot; version=&quot;21.6.5&quot; 
type=&quot;device&quot;&gt;&#10;  &lt [...]
\ No newline at end of file
diff --git a/docs/developers/pushdata.md b/docs/developers/pushdata.md
index 07bdae1c3..694dbe361 100644
--- a/docs/developers/pushdata.md
+++ b/docs/developers/pushdata.md
@@ -100,6 +100,7 @@ For the first three cases, `Worker` informs `ShuffleClient` 
that it should trigg
 `ShuffleClient` triggers split itself.
 
 There are two kinds of Split:
+
 - `HARD_SPLIT`, meaning old `PartitionLocation` epoch refuses to accept any 
data, and future data of the
   `PartitionLocation` will only be pushed after new `PartitionLocation` epoch 
is ready
 - `SOFT_SPLIT`, meaning old `PartitionLocation` epoch continues to accept 
data, when new epoch is ready, `ShuffleClient`
diff --git a/docs/developers/storage.md b/docs/developers/storage.md
new file mode 100644
index 000000000..61adc34f2
--- /dev/null
+++ b/docs/developers/storage.md
@@ -0,0 +1,128 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# Storage
+This article describes the detailed design of Celeborn `Worker`'s storage 
management.
+
+## `PartitionLocation` Physical Storage
+Logically, `PartitionLocation` contains all data with the same partition id. 
Physically, Celeborn stores
+`PartitionLocation` in multiple files, each file corresponds to one 
`PartitionLocation` object with a unique epoch
+for the partition. All `PartitionLocation`s with the same partition id but 
different epochs aggregate to the complete 
+data for the partition. The file can be in memory, local disks, or DFS/OSS, 
see `Multi-layered Storage` below.
+
+A `PartitionLocation` file can be read only after it is committed, trigger by 
`CommitFiles` RPC.
+
+## File Layout
+Celeborn supports two kinds of partitions:
+
+- `ReducePartition`, where each `PartitionLocation` file stores a portion of 
data with the same partition id,
+  currently used for Apache Spark.
+- `MapPartition`, where each `PartitionLocation` file stores a portion of data 
from the same map id, currently
+  used for Apache Flink.
+
+#### ReducePartition
+The layout of `ReducePartition` is as follows:
+
+![ReducePartition](../../assets/img/reducepartition.svg)
+
+`ReducePartition` data file consists of several chunks (defaults to 8 MiB). 
Each data file has an in-memory index
+which points to start positions of each chunk. Upon requesting data from some 
partition, `Worker` first returns the
+index, then sequentially reads and returns a chunk upon each 
`ChunkFetchRequest`, which is very efficient.
+
+#### MapPartition
+The layout of `MapPartition` is as follows:
+
+![MapPartition](../../assets/img/mappartition.svg)
+
+`MapPartition` data file consists of several regions (defaults to 64MiB), each 
region is sorted by partition id.
+Each region has an in-memory index which points to start positions of each 
partition. Upon requesting data from
+some partition, `Worker` reads the partition data from every region.
+
+For more details about reading data, please refer to 
[ReadData](../../developers/readdata).
+
+## Local Disk and Memory Buffer
+To the time this article is written, the most common case is local disk only. 
Users specify directories and
+capacity that Celeborn can use to store data. It is recommended to specify one 
directory per disk. If users
+specify more directories on one disk, Celeborn will try to figure it out and 
manage in the disk-level
+granularity.
+
+`Worker` periodically checks disk health, isolates unhealthy or spaceless 
disks, and reports to `Master`
+through heartbeat.
+
+Upon receiving `ReserveSlots`, `Worker` will first try to create a 
`FileWriter` on the hinted disk. If that disk is
+unavailable, `Worker` will choose a healthy one.
+
+Upon receiving `PushData` or `PushMergedData`, `Worker` unpacks the data (for 
`PushMergedData`) and logically appends
+to the buffered data for each `PartitionLocation` (no physical memory copy). 
If the buffer exceeds the threshold
+(defaults to 256KiB), data will be flushed to the file asynchronously.
+
+If data replication is turned on, `Worker` will send the data to replica 
asynchronously. Only after `Worker`
+receives ACK from replica will it return ACK to `ShuffleClient`. Notice that 
it's not required that data is flushed
+to file before sending ACK.
+
+Upon receiving `CommitFiles`, `Worker` will flush all buffered data for 
`PartitionLocation`s specified in
+the RPC and close files, then responds the succeeded and failed 
`PartitionLocation` lists.
+
+## Trigger Split
+Upon receiving `PushData` (note: currently receiving `PushMergedData` does not 
trigger Split, it's future work),
+`Worker` will check whether disk usage exceeds disk reservation (defaults to 
5GiB). If so, `Worker` will respond
+Split to `ShuffleClient`.
+
+Celeborn supports two configurable kinds of split:
+
+- `HARD_SPLIT`, meaning old `PartitionLocation` epoch refuses to accept any 
data, and future data of the
+  `PartitionLocation` will only be pushed after new `PartitionLocation` epoch 
is ready
+- `SOFT_SPLIT`, meaning old `PartitionLocation` epoch continues to accept 
data, when new epoch is ready, `ShuffleClient`
+  switches to the new location transparently
+
+The detailed design of split can be found 
[Here](../../developers/pushdata#split).
+
+## Self Check
+In additional to health and space check on each disk, `Worker` also collects 
perf statistics to feed Master for
+better [slots allocation](../../developers/slotsallocation):
+
+- Average flush time of the last time window
+- Average fetch time of the last time window
+
+## Multi-layered Storage
+Celeborn aims to store data in multiple layers, i.e. memory, local disks and 
distributed file systems(or object store
+like S3, OSS). To the time this article is written, Celeborn supports local 
disks and HDFS.
+
+The principles of data placement are:
+
+- Try to cache small data in memory
+- Always prefer faster storage
+- Trade off between faster storage's space and cost of data movement
+
+The high-level design of multi-layered storage is:
+
+![storage](../../assets/img/multilayer.svg)
+
+`Worker`'s memory is divided into two logical regions: `Push Region` and 
`Cache Region`. `ShuffleClient` pushes data
+into `Push Region`, as ① indicates. Whenever the buffered data in `PushRegion` 
for a `PartitionLocation` exceeds the
+threshold (defaults to 256KiB), `Worker` flushes it to some storage layer. The 
policy of data movement is as follows:
+
+- If the `PartitionLocation` is not in `Cache Region` and `Cache Region` has 
enough space, logically move the data
+  to `Cache Region`. Notice this just counts the data in `Cache Region` and 
does not physically do memory copy. As ②
+  indicates.
+- If the `PartitionLocation` is in `Cache Region`, logically append the 
current data, as ③ indicates.
+- If the `PartitionLocation` is not in `Cache Region` and `Cache Region` does 
not have enough memory,
+  flush the data into local disk, as ④ indicates.
+- If the `PartitionLocation` is not in `Cache Region` and both `Cache Region` 
and local disk do not have enough memory,
+  flush the data into DFS/OSS, as ⑤ indicates.
+- If the `Cache Region` exceeds the threshold, choose the largest 
`PartitionLocation` and flush it to local disk, as ⑥
+  indicates.
+- Optionally, if local disk does not have enough memory, choose a 
`PartitionLocation` split and evict to HDFS/OSS.
\ No newline at end of file
diff --git a/docs/developers/worker.md b/docs/developers/worker.md
index e69de29bb..05028069d 100644
--- a/docs/developers/worker.md
+++ b/docs/developers/worker.md
@@ -0,0 +1,31 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# Worker
+The main functions of Celeborn `Worker` are:
+
+- Store, serve, and manage `PartitionLocation` data. See 
[Storage](../../developers/storage)
+- Traffic control through `Back Pressure` and `Congestion Control`
+- Support rolling upgrade through `Graceful Shutdown`
+- Support elasticity through `Decommission Shutdown`
+- Self health check
+
+Celeborn `Worker` has four dedicated servers:
+
+- `Controller` handles control messages, i.e. `ReserveSlots`, `CommitFiles`, 
and `DestroyWorkerSlots`
+- `Push Server` handles primary input data, i.e. `PushData` and 
`PushMergedData`, and push related control messages
+- `Replicate Server` handles replica input data, it has the same logic with 
`Push Server`
+- `Fetch Server` handles fetch requests, i.e. `ChunkFetchRequest`, and fetch 
related control messages
\ No newline at end of file
diff --git a/mkdocs.yml b/mkdocs.yml
index b39e13cea..f684d708a 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -83,9 +83,11 @@ nav:
   - Migration Guide: migration.md
   - Developers Doc:
       - Overview: developers/overview.md
-      - Master: developers/master.md
-      - Worker: developers/worker.md
-      - Client: developers/client.md
+#      - Master: developers/master.md
+      - Worker:
+        - Overview: developers/worker.md
+        - Storage: developers/storage.md
+#      - Client: developers/client.md
       - PushData: developers/pushdata.md
-      - ReadData: developers/readdata.md
-      - Slots Allocation: developers/slotsallocation.md
+#      - ReadData: developers/readdata.md
+#      - Slots Allocation: developers/slotsallocation.md

Reply via email to