This is an automated email from the ASF dual-hosted git repository.

roryqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle-website.git


The following commit(s) were added to refs/heads/master by this push:
     new 9e9dcf2  Add new blog (#46)
9e9dcf2 is described below

commit 9e9dcf241b213ffce99d4d3830a47fd21191a84f
Author: roryqi <[email protected]>
AuthorDate: Sun Jul 23 11:04:12 2023 +0800

    Add new blog (#46)
    
    * Add new blog
    
    * fix
---
 blog/2023-07-21/current_state.md      | 165 ++++++++++++++++++++++++++++++++++
 blog/2023-07-21/img/aqe.png           | Bin 0 -> 148308 bytes
 blog/2023-07-21/img/architecture.png  | Bin 0 -> 147255 bytes
 blog/2023-07-21/img/data_read.png     | Bin 0 -> 191060 bytes
 blog/2023-07-21/img/get_results.png   | Bin 0 -> 67869 bytes
 blog/2023-07-21/img/gluten.png        | Bin 0 -> 293542 bytes
 blog/2023-07-21/img/hdfs_fallback.png | Bin 0 -> 51425 bytes
 blog/2023-07-21/img/io_random.png     | Bin 0 -> 50489 bytes
 blog/2023-07-21/img/io_sort.png       | Bin 0 -> 42263 bytes
 blog/2023-07-21/img/metadata.png      | Bin 0 -> 128684 bytes
 blog/2023-07-21/img/process.png       | Bin 0 -> 49584 bytes
 blog/2023-07-21/img/select.png        | Bin 0 -> 165995 bytes
 12 files changed, 165 insertions(+)

diff --git a/blog/2023-07-21/current_state.md b/blog/2023-07-21/current_state.md
new file mode 100644
index 0000000..5d3c61b
--- /dev/null
+++ b/blog/2023-07-21/current_state.md
@@ -0,0 +1,165 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one or more
+  ~ contributor license agreements.  See the NOTICE file distributed with
+  ~ this work for additional information regarding copyright ownership.
+  ~ The ASF licenses this file to You under the Apache License, Version 2.0
+  ~ (the "License"); you may not use this file except in compliance with
+  ~ the License.  You may obtain a copy of the License at
+  ~
+  ~    http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing, software
+  ~ distributed under the License is distributed on an "AS IS" BASIS,
+  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  ~ See the License for the specific language governing permissions and
+  ~ limitations under the License.
+  -->
+  
+# Uniffle: A new chapter for the shuffle in Cloud Native Era
+
+## Background
+Shuffle is the process in distributed computing frameworks used to 
redistribute data between upstream and downstream tasks. It is a crucial 
component within computing frameworks and directly impacts their performance 
and stability. 
+However, with the exploration of cloud-native architectures, traditional 
Shuffle solutions have revealed various issues. 
+In a cloud-native architecture, techniques such as storage-compute separation 
and mixed deployment are also applied simultaneously.
+The computational nodes have relatively small disk capacities, poor IO 
performance, and an imbalance between CPU and IO resources.
+Additionally, computational nodes may be preempted by high-priority jobs due 
to mixed deployments.
+In traditional Shuffle implementations, Shuffle nodes are tightly coupled with 
computational nodes. However, due to the different resource requirements for 
disk, memory, CPU, and node stability between computational and Shuffle nodes, 
it is challenging to independently scale them based on their resource needs.
+By separating the computational nodes from Shuffle nodes, the computational 
node's state becomes more lightweight after offloading the Shuffle state to 
Shuffle nodes, reducing the need for job recomputation when computational nodes 
are preempted. Decoupling computational and Shuffle nodes also reduces the 
demand for disk specifications on computational nodes, enabling an increase in 
the number of accessible computational nodes.
+In cloud-native architectures, large Shuffle jobs can exert significant 
pressure on local disk drives, leading to issues such as insufficient disk 
capacity on computational nodes and higher disk random IO, thus affecting the 
performance and stability of large Shuffle jobs.
+The industry has explored various new Shuffle technologies, including Google's 
BigQuery, Baidu DCE Shuffle, Facebook's Cosco Shuffle, Uber Zeus Shuffle, 
Alibaba's Celeborn Shuffle, and many others.
+Each system has made its own trade-offs based on different scenarios. Uniffle 
aims to create a fast, accurate, stable, and cost-efficient cloud-native Remote 
Shuffle Service, considering performance, correctness, stability, and cost as 
its core aspects.
+
+## Architecture
+![Architecture](img/architecture.png)
+### Coordinator
+The Coordinator is responsible for managing the entire cluster, and the 
Shuffle Server reports the cluster's load situation to the Coordinator through 
heartbeats. Based on the cluster's load, the Coordinator assigns suitable 
Shuffle Servers for jobs. To facilitate operations and maintenance, the 
Coordinator supports configuration deployment and provides a RESTful API for 
external access.
+
+### Shuffle Server
+Shuffle Server is primarily responsible for receiving Shuffle data, 
aggregating it, and writing it into storage. For Shuffle data stored in local 
disks, Shuffle Server provides the ability to read the data.
+
+### Client
+The Client is responsible for communicating with the Coordinator and Shuffle 
Server. It handles tasks such as requesting Shuffle Servers, sending 
heartbeats, and performing read and write operations on Shuffle data. It 
provides an SDK for Spark, MapReduce, and Tez to use.
+
+
+## Read & Write process
+![process](img/process.png)
+1. The Driver obtains allocation information from the Coordinator.
+2. The Driver registers Shuffle information with the Shuffle Server.
+3. Based on the allocation information, the Executor sends Shuffle data to the 
Shuffle Server in the form of Blocks.
+4. The Shuffle Server writes the data into storage.
+5. After write task is completed, the Executor updates the result to the 
Driver.
+6. The read task retrieves successful write task information from the Driver.
+7. The read task retrieves Shuffle metadata (such as all blockIds) from the 
Shuffle Server.
+8. Based on the storage model, the read task reads Shuffle data from the 
storage side.
+
+## Performance
+
+### 1) Hybrid storage
+In our internal production environment, there are Partition data blocks at the 
KB level that account for more than 80% of the total. In order to effectively 
address the random IO issues caused by these small partitions, Uniffle 
incorporates the concept of in-memory Shuffle, taking reference from Google's 
Dremel. Additionally, considering that 80% of the data capacity in the live 
network is due to large partitions, Uniffle introduces disk and HDFS as storage 
media to address the data capa [...]
+
+### 2) Random IO Optimization
+![random_io](img/io_random.png)
+The essence of random IO is the existence of a large number of small data 
block operations. In order to avoid these operations, Uniffle first aggregates 
multiple MapTasks' identical partitions in the memory of the Shuffle Server to 
generate larger partition data. When the Shuffle data in memory reaches the 
partition threshold or the overall threshold, it is written into local or 
remote storage.
+![io_sort](img/io_sort.png)
+When the overall threshold of memory is reached, the Partition data in memory 
is sorted based on their size. The larger Partitions are written to the storage 
media first. Additionally, when the data in memory drops to a certain level, 
the writing of Shuffle data to the storage media is stopped, further reducing 
random IO on the disk.
+
+
+### 3) Storage media selection strategy
+![select](img/select.png)
+For writing Shuffle data to local or remote storage, Uniffle has observed 
through testing that larger data block sizes result in better write performance 
for remote storage. When the data block size exceeds 64MB, the write 
performance to remote storage can reach 32MB/s. Therefore, when writing data to 
storage media, Uniffle selects to write larger data blocks to remote storage 
based on their size, while smaller data blocks are written to local storage.
+
+### 4) Write concurrency
+For larger partitions, it is challenging to meet the performance requirements 
of writing to remote storage with a single thread. In HDFS, a file can only be 
written by one writer. To address this limitation, Uniffle allows multiple 
files to be mapped to a single partition for remote storage. Uniffle utilizes 
multi-threading to increase the write performance of large partitions. However, 
it's important to note that a single partition occupying all remote storage 
threads can affect the wri [...]
+
+### 5) Data Distribution
+![aqe](img/aqe.png)
+For computational engines like Spark AQE (Adaptive Query Execution), there are 
scenarios where a single task needs to read only a portion of a partition's 
data, as well as scenarios where multiple partitions need to be read. In the 
case of reading a portion of a partition's data, if the data is randomly 
distributed, it can result in a significant amount of read amplification. 
Performing data sorting and rewriting after the data is written can lead to 
considerable performance loss. Theref [...]
+![get_results](img/get_results.png)
+In scenarios where multiple partitions need to be read, an optimization 
technique in Uniffle involves allocating the task of reading multiple 
partitions to a single ShuffleServer. This allows for the aggregation of Rpc 
(Remote Procedure Call) requests, which means that multiple Rpc requests can be 
sent to a single Shuffle Server. This approach helps to minimize network 
overhead and improve overall performance. For more detailed information, please 
refer to [4].
+
+### 6) Off-heap memory management
+In the data communication process of Uniffle, Grpc is used, and there are 
multiple memory copying processes in the Grpc code implementation. 
Additionally, the Shuffle Server currently uses heap memory for management. 
When using an 80GB memory Shuffle Server in a production environment, it may 
experience a significant amount of garbage collection (GC), with individual GC 
pauses lasting approximately 22 seconds. To address this issue, Uniffle 
upgraded the JDK to version 11. On the data tra [...]
+
+### 7) Columnar Shuffle Format
+The Uniffle framework itself does not natively support columnar Shuffle. To 
leverage the columnar Shuffle capabilities, Uniffle integrates with Gluten, a 
columnar shuffle component. By integrating with Gluten, Uniffle is able to 
reuse the columnar Shuffle capabilities provided by Gluten. For more detailed 
information, please refer to [5].
+![Gluten](img/gluten.png)
+### 8) Remove barrier execution
+For batch processing in distributed computing frameworks, the commonly used 
model is the Bulk Synchronous Parallel (BSP) model. In this model, downstream 
tasks are only started after all upstream tasks have completed. However, to 
reduce the impact of straggler nodes on job performance, some computing 
frameworks support a "slow start" mechanism to allow upstream and downstream 
tasks to run concurrently. On the other hand, for stream processing and OLAP 
engines, a pipeline model is used wh [...]
+
+To cater to various computing frameworks, Uniffle employs a barrier-free 
design that allows upstream and downstream stages to run concurrently. The key 
to achieving this barrier-free execution lies in supporting efficient in-memory 
read/write operations and an effective index filtering mechanism. With this 
design, job execution does not require a request to the Shuffle Server for 
writing all data to storage media at the end of each stage. Additionally, since 
upstream and downstream stage [...]
+
+Uniffle has designed both bitmap index filtering and file index filtering 
mechanisms to handle in-memory and storage media data respectively. This 
enables Uniffle to efficiently support barrier-free execution and improve 
performance by avoiding redundant data reads.
+
+
+### Performance evaluation
+When using version 0.2 of Uniffle and conducting benchmarks, it was found that 
Uniffle's shuffle performance is comparable to Spark's native shuffle for small 
data volumes. However, for large data volumes, Uniffle's shuffle outperforms 
Spark's native shuffle by up to 30%. The benchmark results can be found at the 
following link: 
https://github.com/apache/incubator-uniffle/blob/master/docs/benchmark.md
+
+##  Correctness
+
+### Metadata Verification
+![meta](img/metadata.png)
+Spark reports information about all completed tasks to the driver. In the 
first step of the reducer, the reducer retrieves a list of task unique 
identifiers from the driver. Blocks are the data sent from mappers to the 
shuffle server, and each block has a unique identifier. The data of a block is 
stored in memory, on local disk, or in HDFS. To ensure data integrity, Uniffle 
incorporates metadata verification. Uniffle designs index files for data files 
stored on local disks and in HDFS. T [...]
+
+### Data Verification
+![verify](img/data_read.png)
+To address data corruption issues, Uniffle performs CRC (Cyclic Redundancy 
Check) verification on data blocks. When reading data, Uniffle recalculates the 
CRC and compares it with the CRC stored in the file to determine if the data is 
corrupted. This helps prevent reading incorrect data.
+
+
+## Stability
+### 1) Fallback for Hybrid storage
+![hdfs](img/hdfs_fallback.png)
+HDFS online clusters may experience some fluctuations in stability, which can 
result in failures to write data to HDFS during certain time periods. In order 
to minimize the impact caused by HDFS fluctuations, Uniff has designed a 
Fallback mechanism. When writing to HDFS fails, the data will be stored locally 
instead, reducing the impact on the job.
+
+### 2) Flow Control
+Before sending a request, the job client will first request the memory 
resources corresponding to the data. If there is insufficient memory, the job 
will wait and stop sending data, thereby implementing flow control for the job.
+
+### 3) Replication
+Uniffle adopts the Quorum replica protocol, allowing jobs to configure the 
number of replicas for their data writes based on their own needs. This helps 
prevent stability issues caused by having only a single replica for the job.
+
+### 4) Stage Recomputation
+Currently, Spark supports recomputing the entire stage if there is a failure 
in reading from the Shuffle Server, helping the job recover and resume its 
execution.
+
+### 5) Quota Management
+When the cluster capacity reaches its limit or a job reaches the user's quota 
limit, the Coordinator can make the job fallback to the native Spark mode. This 
prevents excessive cluster pressure or the stability of the cluster being 
affected by a single user's erroneous submission of a large number of jobs.
+
+### 6) Coordinator HA
+Uniffle does not choose solutions like Zookeeper, Etcd, or Raft for high 
availability (HA) purposes, mainly considering the complexity introduced by 
these consensus protocol systems. For Uniffle, the Coordinator is stateless and 
does not persist any state. All state information is reported by the Shuffle 
Server through heartbeats, so there is no need to determine which node is the 
master. Deploying multiple Coordinator instances ensures high availability of 
the service.
+
+## Cost Effective
+### 1) Low-Cost Remote Storage
+In general, for a relatively stable business, computational resources tend to 
remain stable while storage resources grow linearly. These storage resources 
store a large amount of cold data. Uniffle supports hybrid storage, which 
allows the utilization of these unused storage resources, thereby reducing the 
overall system cost.
+
+### 2) Automatic Scaling
+Uniffle has developed a K8S Operator that implements scaling operations for 
stateful services using webhooks. By leveraging Horizontal Pod Autoscaler 
(HPA), automatic scaling can be achieved, further reducing system costs.
+
+## Community Engagement
+Currently, Uniffle supports multiple computational frameworks such as Spark, 
MapReduce, and Tez. Uniffle Spark has been adopted by companies like Tencent, 
Didi, iQiyi, SF Express, and Vipshop, handling PB-level data on a daily basis. 
Uniffle MapReduce is employed in mixed deployment scenarios by companies like 
Bilibili and Zhihu. Uniffle Tez has been jointly developed by HuoLaLa, Beike, 
and Shein, and is expected to be officially deployed in Q3 2023. The 
development of many important fea [...]
+
+
+## Future Plans
+### Storage Optimization
+1. Integration with object storage to optimize system costs.
+2. Merging index files and data files to further reduce IO overhead.
+3. Support for heterogeneous storage resources such as SSD and HDD.
+4. Support for sorting data by key.
+
+### Computation Optimization
+1. Support for dynamic allocation of Shuffle Servers.
+2. Partial support for Slow Start feature in some engines.
+3. Continuous optimizations for Spark AQE.
+4. Support for the Flink engine.
+5. Asynchronous data reading support for compute engines.
+
+## Summary
+Uniffle has been designed with a focus on performance, correctness, stability, 
and cost-effectiveness, making it a suitable Shuffle system for cloud-native 
architectures. We welcome everyone to contribute to the Uniffle project. The 
Uniffle project can be found at https://github.com/apache/incubator-uniffle.
+
+## Reference
+[1] https://cloud.tencent.com/developer/article/1903023
+
+[2] https://cloud.tencent.com/developer/article/1943179
+
+[3] https://github.com/apache/incubator-uniffle/pull/137
+
+[4] https://github.com/apache/incubator-uniffle/pull/307
+
+[5] https://github.com/apache/incubator-uniffle/pull/950
diff --git a/blog/2023-07-21/img/aqe.png b/blog/2023-07-21/img/aqe.png
new file mode 100644
index 0000000..bc96bdf
Binary files /dev/null and b/blog/2023-07-21/img/aqe.png differ
diff --git a/blog/2023-07-21/img/architecture.png 
b/blog/2023-07-21/img/architecture.png
new file mode 100644
index 0000000..e3047c5
Binary files /dev/null and b/blog/2023-07-21/img/architecture.png differ
diff --git a/blog/2023-07-21/img/data_read.png 
b/blog/2023-07-21/img/data_read.png
new file mode 100644
index 0000000..ecfffc4
Binary files /dev/null and b/blog/2023-07-21/img/data_read.png differ
diff --git a/blog/2023-07-21/img/get_results.png 
b/blog/2023-07-21/img/get_results.png
new file mode 100644
index 0000000..220f0b6
Binary files /dev/null and b/blog/2023-07-21/img/get_results.png differ
diff --git a/blog/2023-07-21/img/gluten.png b/blog/2023-07-21/img/gluten.png
new file mode 100644
index 0000000..09b1c20
Binary files /dev/null and b/blog/2023-07-21/img/gluten.png differ
diff --git a/blog/2023-07-21/img/hdfs_fallback.png 
b/blog/2023-07-21/img/hdfs_fallback.png
new file mode 100644
index 0000000..2122fc3
Binary files /dev/null and b/blog/2023-07-21/img/hdfs_fallback.png differ
diff --git a/blog/2023-07-21/img/io_random.png 
b/blog/2023-07-21/img/io_random.png
new file mode 100644
index 0000000..753abe4
Binary files /dev/null and b/blog/2023-07-21/img/io_random.png differ
diff --git a/blog/2023-07-21/img/io_sort.png b/blog/2023-07-21/img/io_sort.png
new file mode 100644
index 0000000..208d6cc
Binary files /dev/null and b/blog/2023-07-21/img/io_sort.png differ
diff --git a/blog/2023-07-21/img/metadata.png b/blog/2023-07-21/img/metadata.png
new file mode 100644
index 0000000..4935a1c
Binary files /dev/null and b/blog/2023-07-21/img/metadata.png differ
diff --git a/blog/2023-07-21/img/process.png b/blog/2023-07-21/img/process.png
new file mode 100644
index 0000000..df70a3d
Binary files /dev/null and b/blog/2023-07-21/img/process.png differ
diff --git a/blog/2023-07-21/img/select.png b/blog/2023-07-21/img/select.png
new file mode 100644
index 0000000..93a4350
Binary files /dev/null and b/blog/2023-07-21/img/select.png differ

Reply via email to