This is an automated email from the ASF dual-hosted git repository.
roryqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle-website.git
The following commit(s) were added to refs/heads/master by this push:
new 15389c0 Polish the blog (#48)
15389c0 is described below
commit 15389c0dcdfdeffdf304466dbd7c801102a7c02d
Author: roryqi <[email protected]>
AuthorDate: Sun Jul 23 12:13:47 2023 +0800
Polish the blog (#48)
---
...pter for the shuffle in the cloud native era.md | 53 ++++++++++++++--------
1 file changed, 33 insertions(+), 20 deletions(-)
diff --git a/blog/2023-07-21/Uniffle: New chapter for the shuffle in the cloud
native era.md b/blog/2023-07-21/Uniffle: New chapter for the shuffle in the
cloud native era.md
index e99f972..d3b6a73 100644
--- a/blog/2023-07-21/Uniffle: New chapter for the shuffle in the cloud native
era.md
+++ b/blog/2023-07-21/Uniffle: New chapter for the shuffle in the cloud native
era.md
@@ -21,30 +21,29 @@
Shuffle is the process in distributed computing frameworks used to
redistribute data between upstream and downstream tasks. It is a crucial
component within computing frameworks and directly impacts their performance
and stability.
However, with the exploration of cloud-native architectures, traditional
Shuffle solutions have revealed various issues.
-In a cloud-native architecture, techniques such as storage-compute separation
and mixed deployment are also applied simultaneously.The computational nodes
have relatively small disk capacities, poor IO performance, and an imbalance
between CPU and IO resources.
-Additionally, computational nodes may be preempted by high-priority jobs due
to mixed deployments.
+In a cloud-native architecture, with use of techniques such as the separation
of storage and compute, mixed deployment.The computational nodes have
relatively low disk volume, poor IO performance, and an imbalance between CPU
and IO resources.
+Additionally, computational nodes could be preempted by high-priority jobs due
to mixed deployments.
-In traditional Shuffle implementations, Shuffle nodes are tightly coupled with
computational nodes. However, due to the different resource requirements for
disk, memory, CPU, and node stability between computational and Shuffle nodes,
it is challenging to independently scale them based on their resource needs.
-By separating the computational nodes from Shuffle nodes, the computational
node's state becomes more lightweight after offloading the Shuffle state to
Shuffle nodes, reducing the need for job recomputation when computational nodes
are preempted.
+In traditional Shuffle implementations, shuffle nodes tightly coupled with
computational nodes. However, due to the different resource requirements for
disk, memory and CPU between computational nodes and shuffle nodes, it is
challenging to independently scale them based on their resource needs.
+By separating the computational nodes from shuffle nodes, the computational
node's state becomes more lightweight after offloading the Shuffle state to
shuffle nodes, reducing the job recomputation when computational nodes are
preempted.
Decoupling computational and Shuffle nodes also reduces the demand for disk
specifications on computational nodes, enabling an increase in the number of
accessible computational nodes.
In cloud-native architectures, large Shuffle jobs can exert significant
pressure on local disk drives, leading to issues such as insufficient disk
capacity on computational nodes and higher disk random IO, thus affecting the
performance and stability of large Shuffle jobs.
-
The industry has explored various new Shuffle technologies, including Google's
BigQuery, Baidu DCE Shuffle, Facebook's Cosco Shuffle, Uber Zeus Shuffle,
Alibaba's Celeborn Shuffle, and many others.
-Each system has made its own trade-offs based on different scenarios. Uniffle
aims to create a fast, accurate, stable, and cost-efficient cloud-native Remote
Shuffle Service, considering performance, correctness, stability, and cost as
its core aspects.
+Each system has made its own trade-offs based on different scenarios. Uniffle
aims to create a fast, accurate, stable and cost-efficient cloud-native Remote
Shuffle Service, considering performance, correctness, stability, and cost as
its core aspects.
## Architecture

### Coordinator
-The Coordinator is responsible for managing the entire cluster, and the
Shuffle Server reports the cluster's load situation to the Coordinator through
heartbeats. Based on the cluster's load, the Coordinator assigns suitable
Shuffle Servers for jobs. To facilitate operations and maintenance, the
Coordinator supports configuration deployment and provides a RESTful API for
external access.
+The Coordinator is responsible for managing the entire cluster, and the
Shuffle Server reports the cluster's load situation to the Coordinator through
heartbeats. Based on the cluster's load, the Coordinator assigns suitable
Shuffle Servers for jobs. To facilitate operations and maintenance, the
Coordinator supports configuration deployment and provides a RESTFUL API for
external access.
### Shuffle Server
-Shuffle Server is primarily responsible for receiving Shuffle data,
aggregating it, and writing it into storage. For Shuffle data stored in local
disks, Shuffle Server provides the ability to read the data.
+Shuffle Server is primarily responsible for receiving , aggregating and
writing shuffle data into storage. For Shuffle data stored in local disks,
Shuffle Server provides the ability to read the data.
### Client
-The Client is responsible for communicating with the Coordinator and Shuffle
Server. It handles tasks such as requesting Shuffle Servers, sending
heartbeats, and performing read and write operations on Shuffle data. It
provides an SDK for Spark, MapReduce, and Tez to use.
+The Client is responsible for communicating with the Coordinator and Shuffle
Server. It handles tasks such as requesting Shuffle Servers, sending
heartbeats, and performing read and write operations on Shuffle data. It
provides an SDK for Spark, MapReduce and Tez to use.
## Read & Write process
@@ -61,21 +60,21 @@ The Client is responsible for communicating with the
Coordinator and Shuffle Ser
## Performance
### 1) Hybrid storage
-In our internal production environment, there are Partition data blocks at the
KB level that account for more than 80% of the total. In order to effectively
address the random IO issues caused by these small partitions, Uniffle
incorporates the concept of in-memory Shuffle, taking reference from Google's
Dremel. Additionally, considering that 80% of the data capacity in the live
network is due to large partitions, Uniffle introduces disk and HDFS as storage
media to address the data capa [...]
+In our internal production environment, there are Partition data blocks at the
KB level which account for more than 80% of the total. In order to effectively
address the random IO issues caused by these small partitions, Uniffle
incorporates the concept of in-memory Shuffle, taking reference from Google's
Dremel. Additionally, considering that 80% of the data capacity in the our
production environment is due to large partitions, Uniffle introduces disk and
HDFS as storage media to addres [...]
### 2) Random IO Optimization

-The essence of random IO is the existence of a large number of small data
block operations. In order to avoid these operations, Uniffle first aggregates
multiple MapTasks' identical partitions in the memory of the Shuffle Server to
generate larger partition data. When the Shuffle data in memory reaches the
partition threshold or the overall threshold, it is written into local or
remote storage.
+The essence of random IO is the existence of numerous small data block
operations. In order to avoid these operations, Uniffle first aggregates
multiple MapTasks' identical partitions in the memory of the Shuffle Server to
generate larger partition data. When the Shuffle data in memory reaches the
partition threshold or the overall threshold, it is written into local or
remote storage.

-When the overall threshold of memory is reached, the Partition data in memory
is sorted based on their size. The larger Partitions are written to the storage
media first. Additionally, when the data in memory drops to a certain level,
the writing of Shuffle data to the storage media is stopped, further reducing
random IO on the disk.
+When the overall threshold of memory reached, Uniffle sorted the partition
data in memory based on their size. Uniffle write the larger partitions to the
storage media first. Additionally, when the data in memory reach to a certain
size, the writing of Shuffle data to the storage media is stopped, let some
data stay in the memory to further reduce random IO on the disk.
### 3) Storage media selection strategy

-For writing Shuffle data to local or remote storage, Uniffle has observed
through testing that larger data block sizes result in better write performance
for remote storage. When the data block size exceeds 64MB, the write
performance to remote storage can reach 32MB/s. Therefore, when writing data to
storage media, Uniffle selects to write larger data blocks to remote storage
based on their size, while smaller data blocks are written to local storage.
+For writing Shuffle data to local or remote storage, Uniffle has observed
through testing that larger data block sizes result in better write performance
for remote storage. When the data block size exceeds 64MB, the write
performance to remote storage can reach 32MB/s. It's enough writing speed if we
use multiple threads to write, comparing to the 100MB/s writing speed of
HDD.Therefore, when writing data to storage media, Uniffle selects to write
larger data blocks to remote storage bas [...]
### 4) Write concurrency
-For larger partitions, it is challenging to meet the performance requirements
of writing to remote storage with a single thread. In HDFS, a file can only be
written by one writer. To address this limitation, Uniffle allows multiple
files to be mapped to a single partition for remote storage. Uniffle utilizes
multi-threading to increase the write performance of large partitions. However,
it's important to note that a single partition occupying all remote storage
threads can affect the wri [...]
+For larger partitions, it is challenging to meet the performance requirements
of writing to remote storage with a single thread. In HDFS, a file can only be
written by one writer. To address this limitation, Uniffle allows multiple
files to be mapped to a single partition for remote storage. Uniffle utilizes
multi-threading to increase the writing performance of large partitions.
However, it's important to note that a single partition occupying all remote
storage threads can affect the w [...]
### 5) Data Distribution

@@ -84,13 +83,14 @@ For computational engines like Spark AQE (Adaptive Query
Execution), there are s
In scenarios where multiple partitions need to be read, an optimization
technique in Uniffle involves allocating the task of reading multiple
partitions to a single ShuffleServer. This allows for the aggregation of Rpc
(Remote Procedure Call) requests, which means that multiple Rpc requests can be
sent to a single Shuffle Server. This approach helps to minimize network
overhead and improve overall performance. For more detailed information, please
refer to [4].
### 6) Off-heap memory management
-In the data communication process of Uniffle, Grpc is used, and there are
multiple memory copying processes in the Grpc code implementation.
Additionally, the Shuffle Server currently uses heap memory for management.
When using an 80GB memory Shuffle Server in a production environment, it may
experience a significant amount of garbage collection (GC), with individual GC
pauses lasting approximately 22 seconds. To address this issue, Uniffle
upgraded the JDK to version 11. On the data tra [...]
+In the data communication process of Uniffle, Grpc is used, and there are
multiple memory copying processes in the Grpc code implementation.
Additionally, the Shuffle Server currently uses heap memory for management.
When using an 80GB memory Shuffle Server in a production environment, it may
experience a significant amount of garbage collection (GC), with individual GC
pauses lasting approximately 22 seconds. To address this issue, Uniffle
upgraded the JDK to version 11. On the data tra [...]
### 7) Columnar Shuffle Format
The Uniffle framework itself does not natively support columnar Shuffle. To
leverage the columnar Shuffle capabilities, Uniffle integrates with Gluten, a
columnar shuffle component. By integrating with Gluten, Uniffle is able to
reuse the columnar Shuffle capabilities provided by Gluten. For more detailed
information, please refer to [5].

-### 8) Remove barrier execution
-For batch processing in distributed computing frameworks, the commonly used
model is the Bulk Synchronous Parallel (BSP) model. In this model, downstream
tasks are only started after all upstream tasks have completed. However, to
reduce the impact of straggler nodes on job performance, some computing
frameworks support a "slow start" mechanism to allow upstream and downstream
tasks to run concurrently. On the other hand, for stream processing and OLAP
engines, a pipeline model is used wh [...]
+
+### 8) Barrier-Free
+For batch processing in distributed computing frameworks, the commonly used
model is the Bulk Synchronous Parallel (BSP) model. In this model, downstream
tasks started after all upstream tasks have completed. However, to reduce the
impact of straggler nodes on job performance, some computing frameworks support
a "slow start" mechanism to allow upstream and downstream tasks to run
concurrently. On the other hand, for stream processing and OLAP engines, a
pipeline model is used where upstr [...]
To cater to various computing frameworks, Uniffle employs a barrier-free
design that allows upstream and downstream stages to run concurrently. The key
to achieving this barrier-free execution lies in supporting efficient in-memory
read/write operations and an effective index filtering mechanism. With this
design, job execution does not require a request to the Shuffle Server for
writing all data to storage media at the end of each stage. Additionally, since
upstream and downstream stage [...]
@@ -98,7 +98,7 @@ Uniffle has designed both bitmap index filtering and file
index filtering mechan
### Performance evaluation
-When using version 0.2 of Uniffle and conducting benchmarks, it was found that
Uniffle's shuffle performance is comparable to Spark's native shuffle for small
data volumes. However, for large data volumes, Uniffle's shuffle outperforms
Spark's native shuffle by up to 30%. The benchmark results can be found at the
following link:
https://github.com/apache/incubator-uniffle/blob/master/docs/benchmark.md
+When using version 0.2 of Uniffle and conducting benchmarks, Uniffle's shuffle
performance is similar to Spark's vanilla shuffle for small data volumes.
However, for large data volumes, Uniffle's shuffle outperforms Spark's vanilla
shuffle by up to 30%. The benchmark results can be found at the following link:
https://github.com/apache/incubator-uniffle/blob/master/docs/benchmark.md
## Correctness
@@ -126,7 +126,7 @@ Uniffle adopts the Quorum replica protocol, allowing jobs
to configure the numbe
Currently, Spark supports recomputing the entire stage if there is a failure
in reading from the Shuffle Server, helping the job recover and resume its
execution.
### 5) Quota Management
-When the cluster capacity reaches its limit or a job reaches the user's quota
limit, the Coordinator can make the job fallback to the native Spark mode. This
prevents excessive cluster pressure or the stability of the cluster being
affected by a single user's erroneous submission of a large number of jobs.
+When a job reaches the user's quota limit, the Coordinator can make the job
fallback to the vanilla Spark mode. This prevents from the situation that a
single user's erroneous submission of numerous jobs.
### 6) Coordinator HA
Uniffle does not choose solutions like Zookeeper, Etcd, or Raft for high
availability (HA) purposes, mainly considering the complexity introduced by
these consensus protocol systems. For Uniffle, the Coordinator is stateless and
does not persist any state. All state information is reported by the Shuffle
Server through heartbeats, so there is no need to determine which node is the
master. Deploying multiple Coordinator instances ensures high availability of
the service.
@@ -139,7 +139,20 @@ In general, for a relatively stable business,
computational resources tend to re
Uniffle has developed a K8S Operator that implements scaling operations for
stateful services using webhooks. By leveraging Horizontal Pod Autoscaler
(HPA), automatic scaling can be achieved, further reducing system costs.
## Community Engagement
-Currently, Uniffle supports multiple computational frameworks such as Spark,
MapReduce, and Tez. Uniffle Spark has been adopted by companies like Tencent,
Didi, iQiyi, SF Express, and Vipshop, handling PB-level data on a daily basis.
Uniffle MapReduce is employed in mixed deployment scenarios by companies like
Bilibili and Zhihu. Uniffle Tez has been jointly developed by HuoLaLa, Beike,
and Shein, and is expected to be officially deployed in Q3 2023. The
development of many important fea [...]
+Currently, Uniffle supports multiple computational frameworks such as Spark,
MapReduce, and Tez.
+
+Uniffle Spark has been adopted by companies like Tencent, Didi, iQiyi, SF
Express, and Vipshop, handling PB-level data on a daily basis.
+
+Uniffle MapReduce is employed in mixed deployment scenarios by companies like
Bilibili and Zhihu
+
+Uniffle Tez has been jointly developed by HuoLaLa, Beike, and Shein.
+
+The development of many important features in the community has involved
contributions from well-known Chinese internet companies.
+For example, iQiyi has contributed support for accessing Kerberos HDFS
clusters and has optimized the performance of Spark AQE on Uniffle.Didi has
added support for multi-tenant job quotas. Netty data plane optimizations were
jointly completed by Didi and Vipshop.The support for Gluten was contributed by
Baidu and SF Express.
+
+Currently, the community has more than 50 contributors, with over 600 commits,
and has released four Apache versions. It is being used by dozens of companies.
Additionally, teams and companies interested in contributing to Uniffle Flink
can contact the Uniffle community at mailto:[email protected].
+
+Currently, there are no companies participating in the community with
deployment scenarios or development plans for Uniffle Flink. Your help in
filling this gap in the community would be greatly appreciated. Uniffle's
design incorporates a large number of mechanisms and strategies, and users are
welcome to contribute strategies that suit their own scenarios.
## Future Plans