[incubator-uniffle-website] branch master updated: Polish the blog (#48)

roryqi Sat, 22 Jul 2023 21:13:57 -0700

This is an automated email from the ASF dual-hosted git repository.

roryqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 15389c0  Polish the blog (#48)
15389c0 is described below

commit 15389c0dcdfdeffdf304466dbd7c801102a7c02d
Author: roryqi <[email protected]>
AuthorDate: Sun Jul 23 12:13:47 2023 +0800

    Polish the blog (#48)
---
 ...pter for the shuffle in the cloud native era.md | 53 ++++++++++++++--------
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/blog/2023-07-21/Uniffle: New chapter for the shuffle in the cloud 
native era.md b/blog/2023-07-21/Uniffle: New chapter for the shuffle in the 
cloud native era.md
index e99f972..d3b6a73 100644
--- a/blog/2023-07-21/Uniffle: New chapter for the shuffle in the cloud native 
era.md   
+++ b/blog/2023-07-21/Uniffle: New chapter for the shuffle in the cloud native 
era.md   
@@ -21,30 +21,29 @@
 Shuffle is the process in distributed computing frameworks used to 
redistribute data between upstream and downstream tasks. It is a crucial 
component within computing frameworks and directly impacts their performance 
and stability. 
 However, with the exploration of cloud-native architectures, traditional 
Shuffle solutions have revealed various issues. 
 
-In a cloud-native architecture, techniques such as storage-compute separation 
and mixed deployment are also applied simultaneously.The computational nodes 
have relatively small disk capacities, poor IO performance, and an imbalance 
between CPU and IO resources.
-Additionally, computational nodes may be preempted by high-priority jobs due 
to mixed deployments.
+In a cloud-native architecture, with use of techniques such as the separation 
of storage and compute, mixed deployment.The computational nodes have 
relatively low disk volume, poor IO performance, and an imbalance between CPU 
and IO resources.
+Additionally, computational nodes could be preempted by high-priority jobs due 
to mixed deployments.
 
-In traditional Shuffle implementations, Shuffle nodes are tightly coupled with 
computational nodes. However, due to the different resource requirements for 
disk, memory, CPU, and node stability between computational and Shuffle nodes, 
it is challenging to independently scale them based on their resource needs.
-By separating the computational nodes from Shuffle nodes, the computational 
node's state becomes more lightweight after offloading the Shuffle state to 
Shuffle nodes, reducing the need for job recomputation when computational nodes 
are preempted. 
+In traditional Shuffle implementations, shuffle nodes tightly coupled with 
computational nodes. However, due to the different resource requirements for 
disk, memory and CPU between computational nodes and shuffle nodes, it is 
challenging to independently scale them based on their resource needs.
+By separating the computational nodes from shuffle nodes, the computational 
node's state becomes more lightweight after offloading the Shuffle state to 
shuffle nodes, reducing the job recomputation when computational nodes are 
preempted. 
 
 Decoupling computational and Shuffle nodes also reduces the demand for disk 
specifications on computational nodes, enabling an increase in the number of 
accessible computational nodes.
 
 In cloud-native architectures, large Shuffle jobs can exert significant 
pressure on local disk drives, leading to issues such as insufficient disk 
capacity on computational nodes and higher disk random IO, thus affecting the 
performance and stability of large Shuffle jobs.
 
-
 The industry has explored various new Shuffle technologies, including Google's 
BigQuery, Baidu DCE Shuffle, Facebook's Cosco Shuffle, Uber Zeus Shuffle, 
Alibaba's Celeborn Shuffle, and many others.
-Each system has made its own trade-offs based on different scenarios. Uniffle 
aims to create a fast, accurate, stable, and cost-efficient cloud-native Remote 
Shuffle Service, considering performance, correctness, stability, and cost as 
its core aspects.
+Each system has made its own trade-offs based on different scenarios. Uniffle 
aims to create a fast, accurate, stable and cost-efficient cloud-native Remote 
Shuffle Service, considering performance, correctness, stability, and cost as 
its core aspects.
 
 ## Architecture
 ![Architecture](img/architecture.png)
 ### Coordinator
-The Coordinator is responsible for managing the entire cluster, and the 
Shuffle Server reports the cluster's load situation to the Coordinator through 
heartbeats. Based on the cluster's load, the Coordinator assigns suitable 
Shuffle Servers for jobs. To facilitate operations and maintenance, the 
Coordinator supports configuration deployment and provides a RESTful API for 
external access.
+The Coordinator is responsible for managing the entire cluster, and the 
Shuffle Server reports the cluster's load situation to the Coordinator through 
heartbeats. Based on the cluster's load, the Coordinator assigns suitable 
Shuffle Servers for jobs. To facilitate operations and maintenance, the 
Coordinator supports configuration deployment and provides a RESTFUL API for 
external access.
 
 ### Shuffle Server
-Shuffle Server is primarily responsible for receiving Shuffle data, 
aggregating it, and writing it into storage. For Shuffle data stored in local 
disks, Shuffle Server provides the ability to read the data.
+Shuffle Server is primarily responsible for receiving , aggregating  and 
writing shuffle data into storage. For Shuffle data stored in local disks, 
Shuffle Server provides the ability to read the data.
 
 ### Client
-The Client is responsible for communicating with the Coordinator and Shuffle 
Server. It handles tasks such as requesting Shuffle Servers, sending 
heartbeats, and performing read and write operations on Shuffle data. It 
provides an SDK for Spark, MapReduce, and Tez to use.
+The Client is responsible for communicating with the Coordinator and Shuffle 
Server. It handles tasks such as requesting Shuffle Servers, sending 
heartbeats, and performing read and write operations on Shuffle data. It 
provides an SDK for Spark, MapReduce and Tez to use.
 
 
 ## Read & Write process
@@ -61,21 +60,21 @@ The Client is responsible for communicating with the 
Coordinator and Shuffle Ser
 ## Performance
 
 ### 1) Hybrid storage
-In our internal production environment, there are Partition data blocks at the 
KB level that account for more than 80% of the total. In order to effectively 
address the random IO issues caused by these small partitions, Uniffle 
incorporates the concept of in-memory Shuffle, taking reference from Google's 
Dremel. Additionally, considering that 80% of the data capacity in the live 
network is due to large partitions, Uniffle introduces disk and HDFS as storage 
media to address the data capa [...]
+In our internal production environment, there are Partition data blocks at the 
KB level which account for more than 80% of the total. In order to effectively 
address the random IO issues caused by these small partitions, Uniffle 
incorporates the concept of in-memory Shuffle, taking reference from Google's 
Dremel. Additionally, considering that 80% of the data capacity in the our 
production environment is due to large partitions, Uniffle introduces disk and 
HDFS as storage media to addres [...]
 
 ### 2) Random IO Optimization
 ![random_io](img/io_random.png)
-The essence of random IO is the existence of a large number of small data 
block operations. In order to avoid these operations, Uniffle first aggregates 
multiple MapTasks' identical partitions in the memory of the Shuffle Server to 
generate larger partition data. When the Shuffle data in memory reaches the 
partition threshold or the overall threshold, it is written into local or 
remote storage.
+The essence of random IO is the existence of numerous small data block 
operations. In order to avoid these operations, Uniffle first aggregates 
multiple MapTasks' identical partitions in the memory of the Shuffle Server to 
generate larger partition data. When the Shuffle data in memory reaches the 
partition threshold or the overall threshold, it is written into local or 
remote storage.
 ![io_sort](img/io_sort.png)
-When the overall threshold of memory is reached, the Partition data in memory 
is sorted based on their size. The larger Partitions are written to the storage 
media first. Additionally, when the data in memory drops to a certain level, 
the writing of Shuffle data to the storage media is stopped, further reducing 
random IO on the disk.
+When the overall threshold of memory reached, Uniffle sorted the partition 
data in memory based on their size. Uniffle write the larger partitions to the 
storage media first. Additionally, when the data in memory reach to a certain 
size, the writing of Shuffle data to the storage media is stopped, let some 
data stay in the memory to further reduce random IO on the disk.
 
 
 ### 3) Storage media selection strategy
 ![select](img/select.png)
-For writing Shuffle data to local or remote storage, Uniffle has observed 
through testing that larger data block sizes result in better write performance 
for remote storage. When the data block size exceeds 64MB, the write 
performance to remote storage can reach 32MB/s. Therefore, when writing data to 
storage media, Uniffle selects to write larger data blocks to remote storage 
based on their size, while smaller data blocks are written to local storage.
+For writing Shuffle data to local or remote storage, Uniffle has observed 
through testing that larger data block sizes result in better write performance 
for remote storage. When the data block size exceeds 64MB, the write 
performance to remote storage can reach 32MB/s. It's enough writing speed if we 
use multiple threads to write, comparing to the 100MB/s writing speed of 
HDD.Therefore, when writing data to storage media, Uniffle selects to write 
larger data blocks to remote storage bas [...]
 
 ### 4) Write concurrency
-For larger partitions, it is challenging to meet the performance requirements 
of writing to remote storage with a single thread. In HDFS, a file can only be 
written by one writer. To address this limitation, Uniffle allows multiple 
files to be mapped to a single partition for remote storage. Uniffle utilizes 
multi-threading to increase the write performance of large partitions. However, 
it's important to note that a single partition occupying all remote storage 
threads can affect the wri [...]
+For larger partitions, it is challenging to meet the performance requirements 
of writing to remote storage with a single thread. In HDFS, a file can only be 
written by one writer. To address this limitation, Uniffle allows multiple 
files to be mapped to a single partition for remote storage. Uniffle utilizes 
multi-threading to increase the writing performance of large partitions. 
However, it's important to note that a single partition occupying all remote 
storage threads can affect the w [...]
 
 ### 5) Data Distribution
 ![aqe](img/aqe.png)
@@ -84,13 +83,14 @@ For computational engines like Spark AQE (Adaptive Query 
Execution), there are s
 In scenarios where multiple partitions need to be read, an optimization 
technique in Uniffle involves allocating the task of reading multiple 
partitions to a single ShuffleServer. This allows for the aggregation of Rpc 
(Remote Procedure Call) requests, which means that multiple Rpc requests can be 
sent to a single Shuffle Server. This approach helps to minimize network 
overhead and improve overall performance. For more detailed information, please 
refer to [4].
 
 ### 6) Off-heap memory management
-In the data communication process of Uniffle, Grpc is used, and there are 
multiple memory copying processes in the Grpc code implementation. 
Additionally, the Shuffle Server currently uses heap memory for management. 
When using an 80GB memory Shuffle Server in a production environment, it may 
experience a significant amount of garbage collection (GC), with individual GC 
pauses lasting approximately 22 seconds. To address this issue, Uniffle 
upgraded the JDK to version 11. On the data tra [...]
+In the data communication process of Uniffle, Grpc is used, and there are 
multiple memory copying processes in the Grpc code implementation. 
Additionally, the Shuffle Server currently uses heap memory for management. 
When using an 80GB memory Shuffle Server in a production environment, it may 
experience a significant amount of garbage collection (GC), with individual GC 
pauses lasting approximately 22 seconds. To address this issue, Uniffle 
upgraded the JDK to version 11. On the data tra [...]
 
 ### 7) Columnar Shuffle Format
 The Uniffle framework itself does not natively support columnar Shuffle. To 
leverage the columnar Shuffle capabilities, Uniffle integrates with Gluten, a 
columnar shuffle component. By integrating with Gluten, Uniffle is able to 
reuse the columnar Shuffle capabilities provided by Gluten. For more detailed 
information, please refer to [5].
 ![Gluten](img/gluten.png)
-### 8) Remove barrier execution
-For batch processing in distributed computing frameworks, the commonly used 
model is the Bulk Synchronous Parallel (BSP) model. In this model, downstream 
tasks are only started after all upstream tasks have completed. However, to 
reduce the impact of straggler nodes on job performance, some computing 
frameworks support a "slow start" mechanism to allow upstream and downstream 
tasks to run concurrently. On the other hand, for stream processing and OLAP 
engines, a pipeline model is used wh [...]
+
+### 8) Barrier-Free
+For batch processing in distributed computing frameworks, the commonly used 
model is the Bulk Synchronous Parallel (BSP) model. In this model, downstream 
tasks started after all upstream tasks have completed. However, to reduce the 
impact of straggler nodes on job performance, some computing frameworks support 
a "slow start" mechanism to allow upstream and downstream tasks to run 
concurrently. On the other hand, for stream processing and OLAP engines, a 
pipeline model is used where upstr [...]
 
 To cater to various computing frameworks, Uniffle employs a barrier-free 
design that allows upstream and downstream stages to run concurrently. The key 
to achieving this barrier-free execution lies in supporting efficient in-memory 
read/write operations and an effective index filtering mechanism. With this 
design, job execution does not require a request to the Shuffle Server for 
writing all data to storage media at the end of each stage. Additionally, since 
upstream and downstream stage [...]
 
@@ -98,7 +98,7 @@ Uniffle has designed both bitmap index filtering and file 
index filtering mechan
 
 
 ### Performance evaluation
-When using version 0.2 of Uniffle and conducting benchmarks, it was found that 
Uniffle's shuffle performance is comparable to Spark's native shuffle for small 
data volumes. However, for large data volumes, Uniffle's shuffle outperforms 
Spark's native shuffle by up to 30%. The benchmark results can be found at the 
following link: 
https://github.com/apache/incubator-uniffle/blob/master/docs/benchmark.md
+When using version 0.2 of Uniffle and conducting benchmarks, Uniffle's shuffle 
performance is similar to Spark's vanilla shuffle for small data volumes. 
However, for large data volumes, Uniffle's shuffle outperforms Spark's vanilla 
shuffle by up to 30%. The benchmark results can be found at the following link: 
https://github.com/apache/incubator-uniffle/blob/master/docs/benchmark.md
 
 ##  Correctness
 
@@ -126,7 +126,7 @@ Uniffle adopts the Quorum replica protocol, allowing jobs 
to configure the numbe
 Currently, Spark supports recomputing the entire stage if there is a failure 
in reading from the Shuffle Server, helping the job recover and resume its 
execution.
 
 ### 5) Quota Management
-When the cluster capacity reaches its limit or a job reaches the user's quota 
limit, the Coordinator can make the job fallback to the native Spark mode. This 
prevents excessive cluster pressure or the stability of the cluster being 
affected by a single user's erroneous submission of a large number of jobs.
+When a job reaches the user's quota limit, the Coordinator can make the job 
fallback to the vanilla Spark mode. This prevents from the situation that a 
single user's erroneous submission of numerous jobs.
 
 ### 6) Coordinator HA
 Uniffle does not choose solutions like Zookeeper, Etcd, or Raft for high 
availability (HA) purposes, mainly considering the complexity introduced by 
these consensus protocol systems. For Uniffle, the Coordinator is stateless and 
does not persist any state. All state information is reported by the Shuffle 
Server through heartbeats, so there is no need to determine which node is the 
master. Deploying multiple Coordinator instances ensures high availability of 
the service.
@@ -139,7 +139,20 @@ In general, for a relatively stable business, 
computational resources tend to re
 Uniffle has developed a K8S Operator that implements scaling operations for 
stateful services using webhooks. By leveraging Horizontal Pod Autoscaler 
(HPA), automatic scaling can be achieved, further reducing system costs.
 
 ## Community Engagement
-Currently, Uniffle supports multiple computational frameworks such as Spark, 
MapReduce, and Tez. Uniffle Spark has been adopted by companies like Tencent, 
Didi, iQiyi, SF Express, and Vipshop, handling PB-level data on a daily basis. 
Uniffle MapReduce is employed in mixed deployment scenarios by companies like 
Bilibili and Zhihu. Uniffle Tez has been jointly developed by HuoLaLa, Beike, 
and Shein, and is expected to be officially deployed in Q3 2023. The 
development of many important fea [...]
+Currently, Uniffle supports multiple computational frameworks such as Spark, 
MapReduce, and Tez.
+
+Uniffle Spark has been adopted by companies like Tencent, Didi, iQiyi, SF 
Express, and Vipshop, handling PB-level data on a daily basis.
+
+Uniffle MapReduce is employed in mixed deployment scenarios by companies like 
Bilibili and Zhihu
+ 
+Uniffle Tez has been jointly developed by HuoLaLa, Beike, and Shein.
+
+The development of many important features in the community has involved 
contributions from well-known Chinese internet companies.
+For example, iQiyi has contributed support for accessing Kerberos HDFS 
clusters and has optimized the performance of Spark AQE on Uniffle.Didi has 
added support for multi-tenant job quotas. Netty data plane optimizations were 
jointly completed by Didi and Vipshop.The support for Gluten was contributed by 
Baidu and SF Express.
+
+Currently, the community has more than 50 contributors, with over 600 commits, 
and has released four Apache versions. It is being used by dozens of companies. 
Additionally, teams and companies interested in contributing to Uniffle Flink 
can contact the Uniffle community at mailto:[email protected].
+
+Currently, there are no companies participating in the community with 
deployment scenarios or development plans for Uniffle Flink. Your help in 
filling this gap in the community would be greatly appreciated. Uniffle's 
design incorporates a large number of mechanisms and strategies, and users are 
welcome to contribute strategies that suit their own scenarios.
 
 
 ## Future Plans

[incubator-uniffle-website] branch master updated: Polish the blog (#48)

Reply via email to