This is an automated email from the ASF dual-hosted git repository.
roryqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle-website.git
The following commit(s) were added to refs/heads/master by this push:
new 9e9dcf2 Add new blog (#46)
9e9dcf2 is described below
commit 9e9dcf241b213ffce99d4d3830a47fd21191a84f
Author: roryqi <[email protected]>
AuthorDate: Sun Jul 23 11:04:12 2023 +0800
Add new blog (#46)
* Add new blog
* fix
---
blog/2023-07-21/current_state.md | 165 ++++++++++++++++++++++++++++++++++
blog/2023-07-21/img/aqe.png | Bin 0 -> 148308 bytes
blog/2023-07-21/img/architecture.png | Bin 0 -> 147255 bytes
blog/2023-07-21/img/data_read.png | Bin 0 -> 191060 bytes
blog/2023-07-21/img/get_results.png | Bin 0 -> 67869 bytes
blog/2023-07-21/img/gluten.png | Bin 0 -> 293542 bytes
blog/2023-07-21/img/hdfs_fallback.png | Bin 0 -> 51425 bytes
blog/2023-07-21/img/io_random.png | Bin 0 -> 50489 bytes
blog/2023-07-21/img/io_sort.png | Bin 0 -> 42263 bytes
blog/2023-07-21/img/metadata.png | Bin 0 -> 128684 bytes
blog/2023-07-21/img/process.png | Bin 0 -> 49584 bytes
blog/2023-07-21/img/select.png | Bin 0 -> 165995 bytes
12 files changed, 165 insertions(+)
diff --git a/blog/2023-07-21/current_state.md b/blog/2023-07-21/current_state.md
new file mode 100644
index 0000000..5d3c61b
--- /dev/null
+++ b/blog/2023-07-21/current_state.md
@@ -0,0 +1,165 @@
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one or more
+ ~ contributor license agreements. See the NOTICE file distributed with
+ ~ this work for additional information regarding copyright ownership.
+ ~ The ASF licenses this file to You under the Apache License, Version 2.0
+ ~ (the "License"); you may not use this file except in compliance with
+ ~ the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing, software
+ ~ distributed under the License is distributed on an "AS IS" BASIS,
+ ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ~ See the License for the specific language governing permissions and
+ ~ limitations under the License.
+ -->
+
+# Uniffle: A new chapter for the shuffle in Cloud Native Era
+
+## Background
+Shuffle is the process in distributed computing frameworks used to
redistribute data between upstream and downstream tasks. It is a crucial
component within computing frameworks and directly impacts their performance
and stability.
+However, with the exploration of cloud-native architectures, traditional
Shuffle solutions have revealed various issues.
+In a cloud-native architecture, techniques such as storage-compute separation
and mixed deployment are also applied simultaneously.
+The computational nodes have relatively small disk capacities, poor IO
performance, and an imbalance between CPU and IO resources.
+Additionally, computational nodes may be preempted by high-priority jobs due
to mixed deployments.
+In traditional Shuffle implementations, Shuffle nodes are tightly coupled with
computational nodes. However, due to the different resource requirements for
disk, memory, CPU, and node stability between computational and Shuffle nodes,
it is challenging to independently scale them based on their resource needs.
+By separating the computational nodes from Shuffle nodes, the computational
node's state becomes more lightweight after offloading the Shuffle state to
Shuffle nodes, reducing the need for job recomputation when computational nodes
are preempted. Decoupling computational and Shuffle nodes also reduces the
demand for disk specifications on computational nodes, enabling an increase in
the number of accessible computational nodes.
+In cloud-native architectures, large Shuffle jobs can exert significant
pressure on local disk drives, leading to issues such as insufficient disk
capacity on computational nodes and higher disk random IO, thus affecting the
performance and stability of large Shuffle jobs.
+The industry has explored various new Shuffle technologies, including Google's
BigQuery, Baidu DCE Shuffle, Facebook's Cosco Shuffle, Uber Zeus Shuffle,
Alibaba's Celeborn Shuffle, and many others.
+Each system has made its own trade-offs based on different scenarios. Uniffle
aims to create a fast, accurate, stable, and cost-efficient cloud-native Remote
Shuffle Service, considering performance, correctness, stability, and cost as
its core aspects.
+
+## Architecture
+
+### Coordinator
+The Coordinator is responsible for managing the entire cluster, and the
Shuffle Server reports the cluster's load situation to the Coordinator through
heartbeats. Based on the cluster's load, the Coordinator assigns suitable
Shuffle Servers for jobs. To facilitate operations and maintenance, the
Coordinator supports configuration deployment and provides a RESTful API for
external access.
+
+### Shuffle Server
+Shuffle Server is primarily responsible for receiving Shuffle data,
aggregating it, and writing it into storage. For Shuffle data stored in local
disks, Shuffle Server provides the ability to read the data.
+
+### Client
+The Client is responsible for communicating with the Coordinator and Shuffle
Server. It handles tasks such as requesting Shuffle Servers, sending
heartbeats, and performing read and write operations on Shuffle data. It
provides an SDK for Spark, MapReduce, and Tez to use.
+
+
+## Read & Write process
+
+1. The Driver obtains allocation information from the Coordinator.
+2. The Driver registers Shuffle information with the Shuffle Server.
+3. Based on the allocation information, the Executor sends Shuffle data to the
Shuffle Server in the form of Blocks.
+4. The Shuffle Server writes the data into storage.
+5. After write task is completed, the Executor updates the result to the
Driver.
+6. The read task retrieves successful write task information from the Driver.
+7. The read task retrieves Shuffle metadata (such as all blockIds) from the
Shuffle Server.
+8. Based on the storage model, the read task reads Shuffle data from the
storage side.
+
+## Performance
+
+### 1) Hybrid storage
+In our internal production environment, there are Partition data blocks at the
KB level that account for more than 80% of the total. In order to effectively
address the random IO issues caused by these small partitions, Uniffle
incorporates the concept of in-memory Shuffle, taking reference from Google's
Dremel. Additionally, considering that 80% of the data capacity in the live
network is due to large partitions, Uniffle introduces disk and HDFS as storage
media to address the data capa [...]
+
+### 2) Random IO Optimization
+
+The essence of random IO is the existence of a large number of small data
block operations. In order to avoid these operations, Uniffle first aggregates
multiple MapTasks' identical partitions in the memory of the Shuffle Server to
generate larger partition data. When the Shuffle data in memory reaches the
partition threshold or the overall threshold, it is written into local or
remote storage.
+
+When the overall threshold of memory is reached, the Partition data in memory
is sorted based on their size. The larger Partitions are written to the storage
media first. Additionally, when the data in memory drops to a certain level,
the writing of Shuffle data to the storage media is stopped, further reducing
random IO on the disk.
+
+
+### 3) Storage media selection strategy
+
+For writing Shuffle data to local or remote storage, Uniffle has observed
through testing that larger data block sizes result in better write performance
for remote storage. When the data block size exceeds 64MB, the write
performance to remote storage can reach 32MB/s. Therefore, when writing data to
storage media, Uniffle selects to write larger data blocks to remote storage
based on their size, while smaller data blocks are written to local storage.
+
+### 4) Write concurrency
+For larger partitions, it is challenging to meet the performance requirements
of writing to remote storage with a single thread. In HDFS, a file can only be
written by one writer. To address this limitation, Uniffle allows multiple
files to be mapped to a single partition for remote storage. Uniffle utilizes
multi-threading to increase the write performance of large partitions. However,
it's important to note that a single partition occupying all remote storage
threads can affect the wri [...]
+
+### 5) Data Distribution
+
+For computational engines like Spark AQE (Adaptive Query Execution), there are
scenarios where a single task needs to read only a portion of a partition's
data, as well as scenarios where multiple partitions need to be read. In the
case of reading a portion of a partition's data, if the data is randomly
distributed, it can result in a significant amount of read amplification.
Performing data sorting and rewriting after the data is written can lead to
considerable performance loss. Theref [...]
+
+In scenarios where multiple partitions need to be read, an optimization
technique in Uniffle involves allocating the task of reading multiple
partitions to a single ShuffleServer. This allows for the aggregation of Rpc
(Remote Procedure Call) requests, which means that multiple Rpc requests can be
sent to a single Shuffle Server. This approach helps to minimize network
overhead and improve overall performance. For more detailed information, please
refer to [4].
+
+### 6) Off-heap memory management
+In the data communication process of Uniffle, Grpc is used, and there are
multiple memory copying processes in the Grpc code implementation.
Additionally, the Shuffle Server currently uses heap memory for management.
When using an 80GB memory Shuffle Server in a production environment, it may
experience a significant amount of garbage collection (GC), with individual GC
pauses lasting approximately 22 seconds. To address this issue, Uniffle
upgraded the JDK to version 11. On the data tra [...]
+
+### 7) Columnar Shuffle Format
+The Uniffle framework itself does not natively support columnar Shuffle. To
leverage the columnar Shuffle capabilities, Uniffle integrates with Gluten, a
columnar shuffle component. By integrating with Gluten, Uniffle is able to
reuse the columnar Shuffle capabilities provided by Gluten. For more detailed
information, please refer to [5].
+
+### 8) Remove barrier execution
+For batch processing in distributed computing frameworks, the commonly used
model is the Bulk Synchronous Parallel (BSP) model. In this model, downstream
tasks are only started after all upstream tasks have completed. However, to
reduce the impact of straggler nodes on job performance, some computing
frameworks support a "slow start" mechanism to allow upstream and downstream
tasks to run concurrently. On the other hand, for stream processing and OLAP
engines, a pipeline model is used wh [...]
+
+To cater to various computing frameworks, Uniffle employs a barrier-free
design that allows upstream and downstream stages to run concurrently. The key
to achieving this barrier-free execution lies in supporting efficient in-memory
read/write operations and an effective index filtering mechanism. With this
design, job execution does not require a request to the Shuffle Server for
writing all data to storage media at the end of each stage. Additionally, since
upstream and downstream stage [...]
+
+Uniffle has designed both bitmap index filtering and file index filtering
mechanisms to handle in-memory and storage media data respectively. This
enables Uniffle to efficiently support barrier-free execution and improve
performance by avoiding redundant data reads.
+
+
+### Performance evaluation
+When using version 0.2 of Uniffle and conducting benchmarks, it was found that
Uniffle's shuffle performance is comparable to Spark's native shuffle for small
data volumes. However, for large data volumes, Uniffle's shuffle outperforms
Spark's native shuffle by up to 30%. The benchmark results can be found at the
following link:
https://github.com/apache/incubator-uniffle/blob/master/docs/benchmark.md
+
+## Correctness
+
+### Metadata Verification
+
+Spark reports information about all completed tasks to the driver. In the
first step of the reducer, the reducer retrieves a list of task unique
identifiers from the driver. Blocks are the data sent from mappers to the
shuffle server, and each block has a unique identifier. The data of a block is
stored in memory, on local disk, or in HDFS. To ensure data integrity, Uniffle
incorporates metadata verification. Uniffle designs index files for data files
stored on local disks and in HDFS. T [...]
+
+### Data Verification
+
+To address data corruption issues, Uniffle performs CRC (Cyclic Redundancy
Check) verification on data blocks. When reading data, Uniffle recalculates the
CRC and compares it with the CRC stored in the file to determine if the data is
corrupted. This helps prevent reading incorrect data.
+
+
+## Stability
+### 1) Fallback for Hybrid storage
+
+HDFS online clusters may experience some fluctuations in stability, which can
result in failures to write data to HDFS during certain time periods. In order
to minimize the impact caused by HDFS fluctuations, Uniff has designed a
Fallback mechanism. When writing to HDFS fails, the data will be stored locally
instead, reducing the impact on the job.
+
+### 2) Flow Control
+Before sending a request, the job client will first request the memory
resources corresponding to the data. If there is insufficient memory, the job
will wait and stop sending data, thereby implementing flow control for the job.
+
+### 3) Replication
+Uniffle adopts the Quorum replica protocol, allowing jobs to configure the
number of replicas for their data writes based on their own needs. This helps
prevent stability issues caused by having only a single replica for the job.
+
+### 4) Stage Recomputation
+Currently, Spark supports recomputing the entire stage if there is a failure
in reading from the Shuffle Server, helping the job recover and resume its
execution.
+
+### 5) Quota Management
+When the cluster capacity reaches its limit or a job reaches the user's quota
limit, the Coordinator can make the job fallback to the native Spark mode. This
prevents excessive cluster pressure or the stability of the cluster being
affected by a single user's erroneous submission of a large number of jobs.
+
+### 6) Coordinator HA
+Uniffle does not choose solutions like Zookeeper, Etcd, or Raft for high
availability (HA) purposes, mainly considering the complexity introduced by
these consensus protocol systems. For Uniffle, the Coordinator is stateless and
does not persist any state. All state information is reported by the Shuffle
Server through heartbeats, so there is no need to determine which node is the
master. Deploying multiple Coordinator instances ensures high availability of
the service.
+
+## Cost Effective
+### 1) Low-Cost Remote Storage
+In general, for a relatively stable business, computational resources tend to
remain stable while storage resources grow linearly. These storage resources
store a large amount of cold data. Uniffle supports hybrid storage, which
allows the utilization of these unused storage resources, thereby reducing the
overall system cost.
+
+### 2) Automatic Scaling
+Uniffle has developed a K8S Operator that implements scaling operations for
stateful services using webhooks. By leveraging Horizontal Pod Autoscaler
(HPA), automatic scaling can be achieved, further reducing system costs.
+
+## Community Engagement
+Currently, Uniffle supports multiple computational frameworks such as Spark,
MapReduce, and Tez. Uniffle Spark has been adopted by companies like Tencent,
Didi, iQiyi, SF Express, and Vipshop, handling PB-level data on a daily basis.
Uniffle MapReduce is employed in mixed deployment scenarios by companies like
Bilibili and Zhihu. Uniffle Tez has been jointly developed by HuoLaLa, Beike,
and Shein, and is expected to be officially deployed in Q3 2023. The
development of many important fea [...]
+
+
+## Future Plans
+### Storage Optimization
+1. Integration with object storage to optimize system costs.
+2. Merging index files and data files to further reduce IO overhead.
+3. Support for heterogeneous storage resources such as SSD and HDD.
+4. Support for sorting data by key.
+
+### Computation Optimization
+1. Support for dynamic allocation of Shuffle Servers.
+2. Partial support for Slow Start feature in some engines.
+3. Continuous optimizations for Spark AQE.
+4. Support for the Flink engine.
+5. Asynchronous data reading support for compute engines.
+
+## Summary
+Uniffle has been designed with a focus on performance, correctness, stability,
and cost-effectiveness, making it a suitable Shuffle system for cloud-native
architectures. We welcome everyone to contribute to the Uniffle project. The
Uniffle project can be found at https://github.com/apache/incubator-uniffle.
+
+## Reference
+[1] https://cloud.tencent.com/developer/article/1903023
+
+[2] https://cloud.tencent.com/developer/article/1943179
+
+[3] https://github.com/apache/incubator-uniffle/pull/137
+
+[4] https://github.com/apache/incubator-uniffle/pull/307
+
+[5] https://github.com/apache/incubator-uniffle/pull/950
diff --git a/blog/2023-07-21/img/aqe.png b/blog/2023-07-21/img/aqe.png
new file mode 100644
index 0000000..bc96bdf
Binary files /dev/null and b/blog/2023-07-21/img/aqe.png differ
diff --git a/blog/2023-07-21/img/architecture.png
b/blog/2023-07-21/img/architecture.png
new file mode 100644
index 0000000..e3047c5
Binary files /dev/null and b/blog/2023-07-21/img/architecture.png differ
diff --git a/blog/2023-07-21/img/data_read.png
b/blog/2023-07-21/img/data_read.png
new file mode 100644
index 0000000..ecfffc4
Binary files /dev/null and b/blog/2023-07-21/img/data_read.png differ
diff --git a/blog/2023-07-21/img/get_results.png
b/blog/2023-07-21/img/get_results.png
new file mode 100644
index 0000000..220f0b6
Binary files /dev/null and b/blog/2023-07-21/img/get_results.png differ
diff --git a/blog/2023-07-21/img/gluten.png b/blog/2023-07-21/img/gluten.png
new file mode 100644
index 0000000..09b1c20
Binary files /dev/null and b/blog/2023-07-21/img/gluten.png differ
diff --git a/blog/2023-07-21/img/hdfs_fallback.png
b/blog/2023-07-21/img/hdfs_fallback.png
new file mode 100644
index 0000000..2122fc3
Binary files /dev/null and b/blog/2023-07-21/img/hdfs_fallback.png differ
diff --git a/blog/2023-07-21/img/io_random.png
b/blog/2023-07-21/img/io_random.png
new file mode 100644
index 0000000..753abe4
Binary files /dev/null and b/blog/2023-07-21/img/io_random.png differ
diff --git a/blog/2023-07-21/img/io_sort.png b/blog/2023-07-21/img/io_sort.png
new file mode 100644
index 0000000..208d6cc
Binary files /dev/null and b/blog/2023-07-21/img/io_sort.png differ
diff --git a/blog/2023-07-21/img/metadata.png b/blog/2023-07-21/img/metadata.png
new file mode 100644
index 0000000..4935a1c
Binary files /dev/null and b/blog/2023-07-21/img/metadata.png differ
diff --git a/blog/2023-07-21/img/process.png b/blog/2023-07-21/img/process.png
new file mode 100644
index 0000000..df70a3d
Binary files /dev/null and b/blog/2023-07-21/img/process.png differ
diff --git a/blog/2023-07-21/img/select.png b/blog/2023-07-21/img/select.png
new file mode 100644
index 0000000..93a4350
Binary files /dev/null and b/blog/2023-07-21/img/select.png differ