This is an automated email from the ASF dual-hosted git repository.
weitingchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-gluten-site.git
The following commit(s) were added to refs/heads/main by this push:
new 8528bbd update index.md for chart update and latest performance
result (#23)
8528bbd is described below
commit 8528bbdeaa5a93774f85d33de3c4eff76a0b7fcc
Author: Wei-Ting Chen <[email protected]>
AuthorDate: Tue Sep 10 22:17:40 2024 +0800
update index.md for chart update and latest performance result (#23)
---
assets/images/gluten_framework.png | Bin 0 -> 84306 bytes
assets/images/spark_operator_performance.png | Bin 0 -> 33887 bytes
.../velox_tpcds-like_sf3000_time_comparison.png | Bin 0 -> 38519 bytes
.../velox_tpcds-like_sf3000_top20_speedup.png | Bin 0 -> 87878 bytes
assets/images/velox_tpch-like_sf3000_speedup.png | Bin 0 -> 72553 bytes
.../velox_tpch-like_sf3000_time_comparison.png | Bin 0 -> 37962 bytes
contact-us.md | 1 +
contributing.md | 1 +
index.md | 122 ++++++++++++++++++++-
release.md | 3 +-
10 files changed, 124 insertions(+), 3 deletions(-)
diff --git a/assets/images/gluten_framework.png
b/assets/images/gluten_framework.png
new file mode 100644
index 0000000..6d14ff2
Binary files /dev/null and b/assets/images/gluten_framework.png differ
diff --git a/assets/images/spark_operator_performance.png
b/assets/images/spark_operator_performance.png
new file mode 100644
index 0000000..7970abb
Binary files /dev/null and b/assets/images/spark_operator_performance.png differ
diff --git a/assets/images/velox_tpcds-like_sf3000_time_comparison.png
b/assets/images/velox_tpcds-like_sf3000_time_comparison.png
new file mode 100644
index 0000000..b92208a
Binary files /dev/null and
b/assets/images/velox_tpcds-like_sf3000_time_comparison.png differ
diff --git a/assets/images/velox_tpcds-like_sf3000_top20_speedup.png
b/assets/images/velox_tpcds-like_sf3000_top20_speedup.png
new file mode 100644
index 0000000..553b417
Binary files /dev/null and
b/assets/images/velox_tpcds-like_sf3000_top20_speedup.png differ
diff --git a/assets/images/velox_tpch-like_sf3000_speedup.png
b/assets/images/velox_tpch-like_sf3000_speedup.png
new file mode 100644
index 0000000..7e444b0
Binary files /dev/null and b/assets/images/velox_tpch-like_sf3000_speedup.png
differ
diff --git a/assets/images/velox_tpch-like_sf3000_time_comparison.png
b/assets/images/velox_tpch-like_sf3000_time_comparison.png
new file mode 100644
index 0000000..29c37c9
Binary files /dev/null and
b/assets/images/velox_tpch-like_sf3000_time_comparison.png differ
diff --git a/contact-us.md b/contact-us.md
index 63f9bcb..1ff2b4f 100644
--- a/contact-us.md
+++ b/contact-us.md
@@ -2,6 +2,7 @@
layout: page
title: Contact Us
nav_order: 9
+permalink: /contact/
---
# Contact Us
diff --git a/contributing.md b/contributing.md
index 24feb44..67c62ac 100644
--- a/contributing.md
+++ b/contributing.md
@@ -2,6 +2,7 @@
layout: page
title: Contributing to Gluten
nav_order: 7
+permalink: /contributing/
---
# How to become a committer
diff --git a/index.md b/index.md
index fddc91f..b8480e1 100644
--- a/index.md
+++ b/index.md
@@ -15,7 +15,7 @@ Apache Gluten(incubating) is a middle layer responsible for
offloading JVM-based
Apache Spark is a stable, mature project that has been developed for many
years. It is one of the best frameworks to scale out for processing
petabyte-scale datasets. However, the Spark community has had to address
performance challenges that require various optimizations over time. As a key
optimization in Spark 2.0, Whole Stage Code Generation is introduced to replace
Volcano Model, which achieves 2x speedup. Henceforth, most optimizations are at
query plan level. Single operator's per [...]
<p align="center">
-<img
src="https://user-images.githubusercontent.com/47296334/199853029-b6d0ea19-f8e4-4f62-9562-2838f7f159a7.png"
width="800">
+<img src="/assets/images/spark_operator_performance.png" width="800">
</p>
On the other side, SQL engines have been researched for many years. There are
a few libraries like Clickhouse, Arrow and Velox, etc. By using features like
native implementation, columnar data format and vectorized data processing,
these libraries can outperform Spark's JVM based SQL engine. However, these
libraries only support single node execution.
@@ -48,7 +48,7 @@ You can click below links for more related information.
The overview chart is like below. Substrait provides a well-defined
cross-language specification for data compute operations (see more details
[here](https://substrait.io/)). Spark physical plan is transformed to Substrait
plan. Then Substrait plan is passed to native through JNI call.
On native side, the native operator chain will be built out and offloaded to
native engine. Gluten will return Columnar Batch to Spark and Spark Columnar
API (since Spark-3.0) will be used at execution time. Gluten uses Apache Arrow
data format as its basic data format, so the returned data to Spark JVM is
ArrowColumnarBatch.
<p align="center">
-<img
src="https://user-images.githubusercontent.com/47296334/199617207-1140698a-4d53-462d-9bc7-303d14be060b.png"
width="800">
+<img src="/assets/images/gluten_framework.png" width="800">
</p>
Currently, Gluten only supports Clickhouse backend & Velox backend. Velox is a
C++ database acceleration library which provides reusable, extensible and
high-performance data processing components. More details can be found from
https://github.com/facebookincubator/velox/. Gluten can also be extended to
support more backends.
@@ -59,3 +59,121 @@ There are several key components in Gluten:
* **Fallback Mechanism**: supports falling back to Vanilla spark for
unsupported operators. Gluten ColumnarToRow (C2R) and RowToColumnar (R2C) will
convert Gluten columnar data and Spark's internal row data if needed. Both C2R
and R2C are implemented in native code as well
* **Metrics**: collected from Gluten native engine to help identify bugs,
performance bottlenecks, etc. The metrics are displayed in Spark UI.
* **Shim Layer**: supports multiple Spark versions. We plan to only support
Spark's latest 2 or 3 releases. Currently, Spark-3.2, Spark-3.3 & Spark-3.4
(experimental) are supported.
+
+# 3 How to Use
+
+There are two methods for utilizing Gluten. The first is to use a pre-built
JAR for a quick test of Gluten's acceleration capabilities. The second is to
compile the JAR yourself, ensuring it runs on your target platform and delivers
the best possible performance.
+
+# 3.1 Use a pre-Built Jar
+
+One way is to use released jar. Here is a simple example. Currently, only
centos7/8 and ubuntu20.04/22.04 are well supported.
+You can find a pre-built gluten release jar in [Apache Gluten
Download](https://downloads.apache.org/incubator/gluten/).
+Please be aware that the pre-built JAR is compiled using static linking via
vcpkg, based on the Intel® Xeon® Gold 6252 processor.
+It should be compatible with most operating systems, including CentOS and
Ubuntu, although performance is not guaranteed.
+For the optimal performance experience, we recommend using the 3.2 Custom
Build to compile Gluten from source tailored to your specific platform.
+
+```
+spark-shell \
+ --master yarn --deploy-mode client \
+ --conf spark.plugins=org.apache.gluten.GlutenPlugin \
+ --conf spark.memory.offHeap.enabled=true \
+ --conf spark.memory.offHeap.size=20g \
+ --conf
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
+ --jars gluten-velox-bundle-spark3.2_2.12-centos_7_x86_64-1.2.0.jar
+```
+
+# 3.2 Custom Build
+
+To utilize Gluten with Spark, you can compile Gluten from the source and then
configure it to enable the Gluten plugin. Below is a straightforward example.
For more comprehensive instructions, please refer to the detailed guidance
provided in the corresponding backend section.
+
+```
+export gluten_jar = /PATH/TO/GLUTEN/backends-velox/target/<gluten-jar>
+spark-shell \
+ --master yarn --deploy-mode client \
+ --conf spark.plugins=org.apache.gluten.GlutenPlugin \
+ --conf spark.memory.offHeap.enabled=true \
+ --conf spark.memory.offHeap.size=20g \
+ --conf spark.driver.extraClassPath=${gluten_jar} \
+ --conf spark.executor.extraClassPath=${gluten_jar} \
+ --conf
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
+ ...
+```
+
+## 3.2.1 Build and install Gluten with Velox backend
+
+If you want to use Gluten **Velox** backend, see [Build with
Velox](https://gluten.apache.org/docs/getting-started/velox-backend/) to build
and install the necessary libraries.
+You can also find more information in [Gluten with Velox
backend](https://gluten.apache.org/docs/velox-backend/) and
[velox](https://github.com/facebookincubator/velox).
+
+## 3.2.2 Build and install Gluten with ClickHouse backend
+
+If you want to use Gluten **ClickHouse** backend, see [Build with ClickHouse
Backend](http://gluten.apache.org/docs/getting-started/clickhouse-backend/).
+ClickHouse backend is developed by [Kyligence](https://kyligence.io/), please
visit [Kyligence's ClickHouse](https://github.com/Kyligence/ClickHouse) for
more infomation.
+
+## 3.2.3 Build options
+
+See [Build
Parameters](https://gluten.apache.org/docs/getting-started/build-guide/).
+
+
+# 4 Join the Community
+
+## Contact Us
+
+Gluten was initiated by Intel and Kyligence in 2022. Several companies are
also actively participating in the development, such as BIGO, Meituan, Alibaba
Cloud, NetEase, Baidu, Microsoft, etc. If you are interested in Gluten project,
please contact and subscribe below mailing lists for further discussion.
+
+Please see [contact us](https://gluten.apache.org/contact/) for more
information.
+
+## Source Code
+
+Please see [gluten source code](https://github.com/apache/incubator-gluten)
for more information.
+
+## How to Contribute to Gluten
+
+Please see [contributing guide](https://gluten.apache.org/contributing/) about
how to make contributions.
+
+## Wechat Group
+
+There is a Wechat group (in Chinese) which may be more friendly for PRC
developers/users. Due to the limitation of wechat group, please mail to
[email protected].
+
+## Slack channel
+
+For Velox backend, we recommend to use [velox
community](https://github.com/facebookincubator/velox?tab=readme-ov-file#community).
+
+
+# 5 Performance
+
+We use TPCH-like and TPCDS-like as decison support benchmarks to evaluate
Gluten's performance.
+TPCH-like is a query set for modified from [TPC-H
benchmark](http://tpc.org/tpch/default5.asp) and TPCDS-like is a query set for
modified from [TPC-DS benchmark](https://tpc.org/tpcds/default5.asp). We use
Parquet file format for Velox testing & MergeTree file format for Clickhouse
testing, compared to Parquet file format as baseline. See [Decision Support
Benchmark](https://github.com/apache/incubator-gluten/tree/main/tools/workload).
+
+## Gluten + Velox backend Performance
+
+The below test environment: single node with 3TB data on Intel® Xeon® Platinum
8592+; Spark-3.3.1 for both baseline and Gluten.
+
+The TPCH-like result (tested in Mar. 2024) shows an overall speedup of 3.34x
and up to 23.45x speedup in a single query with Gluten + Velox backend used.
+
+
+
+
+
+The TPCDS-like result (tested in Mar. 2024) shows an overall speedup of 3.02x
and up to 13.75x speedup in a single query with Gluten + Velox backend used.
+
+
+
+
+
+Notices & Disclaimers
+
+Performance varies by use, configuration and other factors. Learn more on the
Performance Index site. Performance results are based on testing as of dates
shown in configurations and may not reflect all publicly available updates.
See backup for configuration details. No product or component can be
absolutely secure. Your costs and results may vary. Intel technologies may
require enabled hardware, software or service activation. © Intel Corporation.
Intel, the Intel logo, and other In [...]
+
+## Gluten + ClickHouse backend Performance
+
+The below testing environment: a 8-nodes AWS cluster with 1TB data;
Spark-3.1.1 for both baseline and Gluten. The Decision Support Benchmark1
result shows an average speedup of 2.12x and up to 3.48x speedup with Gluten
Clickhouse backend.
+
+
+
+
+# Thanks to our contributors
+
+<a href="https://github.com/apache/incubator-gluten/graphs/contributors">
+ <img
src="https://contrib.rocks/image?repo=apache/incubator-gluten&columns=25" />
+</a>
+
diff --git a/release.md b/release.md
index af8f01b..42c18ba 100644
--- a/release.md
+++ b/release.md
@@ -1,7 +1,8 @@
---
layout: page
-title: Gluten Release
+title: Gluten Releases
nav_order: 5
+permalink: /releases/
---
[Gluten](https://github.com/apache/incubator-gluten) is a plugin for Apache
Spark to double SparkSQL's performance.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]