This is an automated email from the ASF dual-hosted git repository.
zhouky pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git
The following commit(s) were added to refs/heads/main by this push:
new 070d8bc0f [CELEBORN-826][DOC] Add storage document
070d8bc0f is described below
commit 070d8bc0f8572cfa4c8b16e6a47e705eb9c756f4
Author: zky.zhoukeyong <[email protected]>
AuthorDate: Mon Jul 24 16:12:42 2023 +0800
[CELEBORN-826][DOC] Add storage document
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes #1752 from waitinfuture/826.
Authored-by: zky.zhoukeyong <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
---
docs/assets/img/mappartition.svg | 4 ++
docs/assets/img/multilayer.svg | 4 ++
docs/assets/img/reducepartition.svg | 4 ++
docs/developers/pushdata.md | 1 +
docs/developers/storage.md | 128 ++++++++++++++++++++++++++++++++++++
docs/developers/worker.md | 31 +++++++++
mkdocs.yml | 12 ++--
7 files changed, 179 insertions(+), 5 deletions(-)
diff --git a/docs/assets/img/mappartition.svg b/docs/assets/img/mappartition.svg
new file mode 100644
index 000000000..e70da9c49
--- /dev/null
+++ b/docs/assets/img/mappartition.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than draw.io -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="488px"
height="178px" viewBox="-0.5 -0.5 488 178" content="<mxfile
host="app.diagrams.net" modified="2023-07-24T06:59:46.339Z"
agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
etag="IDSWAViEHxnfNJGXZTIJ" version="21.6.5"
type="device"> < [...]
\ No newline at end of file
diff --git a/docs/assets/img/multilayer.svg b/docs/assets/img/multilayer.svg
new file mode 100644
index 000000000..af2c08671
--- /dev/null
+++ b/docs/assets/img/multilayer.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than diagrams.net -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="988px"
height="231px" viewBox="-0.5 -0.5 988 231" content="<mxfile
host="app.diagrams.net" modified="2023-07-24T04:08:38.756Z"
agent="5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/114.0.0.0 Safari/537.36"
etag="9ykfdIQmzQp8y21vunwz" version="16.0.0"
type="device"><diagram id=&qu [...]
\ No newline at end of file
diff --git a/docs/assets/img/reducepartition.svg
b/docs/assets/img/reducepartition.svg
new file mode 100644
index 000000000..c9dcf13c4
--- /dev/null
+++ b/docs/assets/img/reducepartition.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Do not edit this file with editors other than draw.io -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="388px"
height="105px" viewBox="-0.5 -0.5 388 105" content="<mxfile
host="app.diagrams.net" modified="2023-07-24T06:37:26.285Z"
agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
etag="o-vHMtNLAAEOaQ6fgHxS" version="21.6.5"
type="device"> < [...]
\ No newline at end of file
diff --git a/docs/developers/pushdata.md b/docs/developers/pushdata.md
index 07bdae1c3..694dbe361 100644
--- a/docs/developers/pushdata.md
+++ b/docs/developers/pushdata.md
@@ -100,6 +100,7 @@ For the first three cases, `Worker` informs `ShuffleClient`
that it should trigg
`ShuffleClient` triggers split itself.
There are two kinds of Split:
+
- `HARD_SPLIT`, meaning old `PartitionLocation` epoch refuses to accept any
data, and future data of the
`PartitionLocation` will only be pushed after new `PartitionLocation` epoch
is ready
- `SOFT_SPLIT`, meaning old `PartitionLocation` epoch continues to accept
data, when new epoch is ready, `ShuffleClient`
diff --git a/docs/developers/storage.md b/docs/developers/storage.md
new file mode 100644
index 000000000..61adc34f2
--- /dev/null
+++ b/docs/developers/storage.md
@@ -0,0 +1,128 @@
+---
+license: |
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+---
+
+# Storage
+This article describes the detailed design of Celeborn `Worker`'s storage
management.
+
+## `PartitionLocation` Physical Storage
+Logically, `PartitionLocation` contains all data with the same partition id.
Physically, Celeborn stores
+`PartitionLocation` in multiple files, each file corresponds to one
`PartitionLocation` object with a unique epoch
+for the partition. All `PartitionLocation`s with the same partition id but
different epochs aggregate to the complete
+data for the partition. The file can be in memory, local disks, or DFS/OSS,
see `Multi-layered Storage` below.
+
+A `PartitionLocation` file can be read only after it is committed, trigger by
`CommitFiles` RPC.
+
+## File Layout
+Celeborn supports two kinds of partitions:
+
+- `ReducePartition`, where each `PartitionLocation` file stores a portion of
data with the same partition id,
+ currently used for Apache Spark.
+- `MapPartition`, where each `PartitionLocation` file stores a portion of data
from the same map id, currently
+ used for Apache Flink.
+
+#### ReducePartition
+The layout of `ReducePartition` is as follows:
+
+
+
+`ReducePartition` data file consists of several chunks (defaults to 8 MiB).
Each data file has an in-memory index
+which points to start positions of each chunk. Upon requesting data from some
partition, `Worker` first returns the
+index, then sequentially reads and returns a chunk upon each
`ChunkFetchRequest`, which is very efficient.
+
+#### MapPartition
+The layout of `MapPartition` is as follows:
+
+
+
+`MapPartition` data file consists of several regions (defaults to 64MiB), each
region is sorted by partition id.
+Each region has an in-memory index which points to start positions of each
partition. Upon requesting data from
+some partition, `Worker` reads the partition data from every region.
+
+For more details about reading data, please refer to
[ReadData](../../developers/readdata).
+
+## Local Disk and Memory Buffer
+To the time this article is written, the most common case is local disk only.
Users specify directories and
+capacity that Celeborn can use to store data. It is recommended to specify one
directory per disk. If users
+specify more directories on one disk, Celeborn will try to figure it out and
manage in the disk-level
+granularity.
+
+`Worker` periodically checks disk health, isolates unhealthy or spaceless
disks, and reports to `Master`
+through heartbeat.
+
+Upon receiving `ReserveSlots`, `Worker` will first try to create a
`FileWriter` on the hinted disk. If that disk is
+unavailable, `Worker` will choose a healthy one.
+
+Upon receiving `PushData` or `PushMergedData`, `Worker` unpacks the data (for
`PushMergedData`) and logically appends
+to the buffered data for each `PartitionLocation` (no physical memory copy).
If the buffer exceeds the threshold
+(defaults to 256KiB), data will be flushed to the file asynchronously.
+
+If data replication is turned on, `Worker` will send the data to replica
asynchronously. Only after `Worker`
+receives ACK from replica will it return ACK to `ShuffleClient`. Notice that
it's not required that data is flushed
+to file before sending ACK.
+
+Upon receiving `CommitFiles`, `Worker` will flush all buffered data for
`PartitionLocation`s specified in
+the RPC and close files, then responds the succeeded and failed
`PartitionLocation` lists.
+
+## Trigger Split
+Upon receiving `PushData` (note: currently receiving `PushMergedData` does not
trigger Split, it's future work),
+`Worker` will check whether disk usage exceeds disk reservation (defaults to
5GiB). If so, `Worker` will respond
+Split to `ShuffleClient`.
+
+Celeborn supports two configurable kinds of split:
+
+- `HARD_SPLIT`, meaning old `PartitionLocation` epoch refuses to accept any
data, and future data of the
+ `PartitionLocation` will only be pushed after new `PartitionLocation` epoch
is ready
+- `SOFT_SPLIT`, meaning old `PartitionLocation` epoch continues to accept
data, when new epoch is ready, `ShuffleClient`
+ switches to the new location transparently
+
+The detailed design of split can be found
[Here](../../developers/pushdata#split).
+
+## Self Check
+In additional to health and space check on each disk, `Worker` also collects
perf statistics to feed Master for
+better [slots allocation](../../developers/slotsallocation):
+
+- Average flush time of the last time window
+- Average fetch time of the last time window
+
+## Multi-layered Storage
+Celeborn aims to store data in multiple layers, i.e. memory, local disks and
distributed file systems(or object store
+like S3, OSS). To the time this article is written, Celeborn supports local
disks and HDFS.
+
+The principles of data placement are:
+
+- Try to cache small data in memory
+- Always prefer faster storage
+- Trade off between faster storage's space and cost of data movement
+
+The high-level design of multi-layered storage is:
+
+
+
+`Worker`'s memory is divided into two logical regions: `Push Region` and
`Cache Region`. `ShuffleClient` pushes data
+into `Push Region`, as ① indicates. Whenever the buffered data in `PushRegion`
for a `PartitionLocation` exceeds the
+threshold (defaults to 256KiB), `Worker` flushes it to some storage layer. The
policy of data movement is as follows:
+
+- If the `PartitionLocation` is not in `Cache Region` and `Cache Region` has
enough space, logically move the data
+ to `Cache Region`. Notice this just counts the data in `Cache Region` and
does not physically do memory copy. As ②
+ indicates.
+- If the `PartitionLocation` is in `Cache Region`, logically append the
current data, as ③ indicates.
+- If the `PartitionLocation` is not in `Cache Region` and `Cache Region` does
not have enough memory,
+ flush the data into local disk, as ④ indicates.
+- If the `PartitionLocation` is not in `Cache Region` and both `Cache Region`
and local disk do not have enough memory,
+ flush the data into DFS/OSS, as ⑤ indicates.
+- If the `Cache Region` exceeds the threshold, choose the largest
`PartitionLocation` and flush it to local disk, as ⑥
+ indicates.
+- Optionally, if local disk does not have enough memory, choose a
`PartitionLocation` split and evict to HDFS/OSS.
\ No newline at end of file
diff --git a/docs/developers/worker.md b/docs/developers/worker.md
index e69de29bb..05028069d 100644
--- a/docs/developers/worker.md
+++ b/docs/developers/worker.md
@@ -0,0 +1,31 @@
+---
+license: |
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+---
+
+# Worker
+The main functions of Celeborn `Worker` are:
+
+- Store, serve, and manage `PartitionLocation` data. See
[Storage](../../developers/storage)
+- Traffic control through `Back Pressure` and `Congestion Control`
+- Support rolling upgrade through `Graceful Shutdown`
+- Support elasticity through `Decommission Shutdown`
+- Self health check
+
+Celeborn `Worker` has four dedicated servers:
+
+- `Controller` handles control messages, i.e. `ReserveSlots`, `CommitFiles`,
and `DestroyWorkerSlots`
+- `Push Server` handles primary input data, i.e. `PushData` and
`PushMergedData`, and push related control messages
+- `Replicate Server` handles replica input data, it has the same logic with
`Push Server`
+- `Fetch Server` handles fetch requests, i.e. `ChunkFetchRequest`, and fetch
related control messages
\ No newline at end of file
diff --git a/mkdocs.yml b/mkdocs.yml
index b39e13cea..f684d708a 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -83,9 +83,11 @@ nav:
- Migration Guide: migration.md
- Developers Doc:
- Overview: developers/overview.md
- - Master: developers/master.md
- - Worker: developers/worker.md
- - Client: developers/client.md
+# - Master: developers/master.md
+ - Worker:
+ - Overview: developers/worker.md
+ - Storage: developers/storage.md
+# - Client: developers/client.md
- PushData: developers/pushdata.md
- - ReadData: developers/readdata.md
- - Slots Allocation: developers/slotsallocation.md
+# - ReadData: developers/readdata.md
+# - Slots Allocation: developers/slotsallocation.md