(incubator-gluten-site) branch main updated: update index.md for chart update and latest performance result (#23)

weitingchen Tue, 10 Sep 2024 07:19:57 -0700

This is an automated email from the ASF dual-hosted git repository.

weitingchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-gluten-site.git



The following commit(s) were added to refs/heads/main by this push:
     new 8528bbd  update index.md for chart update and latest performance 
result (#23)
8528bbd is described below

commit 8528bbdeaa5a93774f85d33de3c4eff76a0b7fcc
Author: Wei-Ting Chen <[email protected]>
AuthorDate: Tue Sep 10 22:17:40 2024 +0800

    update index.md for chart update and latest performance result (#23)
---
 assets/images/gluten_framework.png                 | Bin 0 -> 84306 bytes
 assets/images/spark_operator_performance.png       | Bin 0 -> 33887 bytes
 .../velox_tpcds-like_sf3000_time_comparison.png    | Bin 0 -> 38519 bytes
 .../velox_tpcds-like_sf3000_top20_speedup.png      | Bin 0 -> 87878 bytes
 assets/images/velox_tpch-like_sf3000_speedup.png   | Bin 0 -> 72553 bytes
 .../velox_tpch-like_sf3000_time_comparison.png     | Bin 0 -> 37962 bytes
 contact-us.md                                      |   1 +
 contributing.md                                    |   1 +
 index.md                                           | 122 ++++++++++++++++++++-
 release.md                                         |   3 +-
 10 files changed, 124 insertions(+), 3 deletions(-)

diff --git a/assets/images/gluten_framework.png 
b/assets/images/gluten_framework.png
new file mode 100644
index 0000000..6d14ff2
Binary files /dev/null and b/assets/images/gluten_framework.png differ
diff --git a/assets/images/spark_operator_performance.png 
b/assets/images/spark_operator_performance.png
new file mode 100644
index 0000000..7970abb
Binary files /dev/null and b/assets/images/spark_operator_performance.png differ
diff --git a/assets/images/velox_tpcds-like_sf3000_time_comparison.png 
b/assets/images/velox_tpcds-like_sf3000_time_comparison.png
new file mode 100644
index 0000000..b92208a
Binary files /dev/null and 
b/assets/images/velox_tpcds-like_sf3000_time_comparison.png differ
diff --git a/assets/images/velox_tpcds-like_sf3000_top20_speedup.png 
b/assets/images/velox_tpcds-like_sf3000_top20_speedup.png
new file mode 100644
index 0000000..553b417
Binary files /dev/null and 
b/assets/images/velox_tpcds-like_sf3000_top20_speedup.png differ
diff --git a/assets/images/velox_tpch-like_sf3000_speedup.png 
b/assets/images/velox_tpch-like_sf3000_speedup.png
new file mode 100644
index 0000000..7e444b0
Binary files /dev/null and b/assets/images/velox_tpch-like_sf3000_speedup.png 
differ
diff --git a/assets/images/velox_tpch-like_sf3000_time_comparison.png 
b/assets/images/velox_tpch-like_sf3000_time_comparison.png
new file mode 100644
index 0000000..29c37c9
Binary files /dev/null and 
b/assets/images/velox_tpch-like_sf3000_time_comparison.png differ
diff --git a/contact-us.md b/contact-us.md
index 63f9bcb..1ff2b4f 100644
--- a/contact-us.md
+++ b/contact-us.md
@@ -2,6 +2,7 @@
 layout: page
 title: Contact Us
 nav_order: 9
+permalink: /contact/
 ---
 # Contact Us
 
diff --git a/contributing.md b/contributing.md
index 24feb44..67c62ac 100644
--- a/contributing.md
+++ b/contributing.md
@@ -2,6 +2,7 @@
 layout: page
 title: Contributing to Gluten
 nav_order: 7
+permalink: /contributing/
 ---
 
 # How to become a committer
diff --git a/index.md b/index.md
index fddc91f..b8480e1 100644
--- a/index.md
+++ b/index.md
@@ -15,7 +15,7 @@ Apache Gluten(incubating) is a middle layer responsible for 
offloading JVM-based
 Apache Spark is a stable, mature project that has been developed for many 
years. It is one of the best frameworks to scale out for processing 
petabyte-scale datasets. However, the Spark community has had to address 
performance challenges that require various optimizations over time. As a key 
optimization in Spark 2.0, Whole Stage Code Generation is introduced to replace 
Volcano Model, which achieves 2x speedup. Henceforth, most optimizations are at 
query plan level. Single operator's per [...]
 
 <p align="center">
-<img 
src="https://user-images.githubusercontent.com/47296334/199853029-b6d0ea19-f8e4-4f62-9562-2838f7f159a7.png";
 width="800">
+<img src="/assets/images/spark_operator_performance.png" width="800">
 </p>
 
 On the other side, SQL engines have been researched for many years. There are 
a few libraries like Clickhouse, Arrow and Velox, etc. By using features like 
native implementation, columnar data format and vectorized data processing, 
these libraries can outperform Spark's JVM based SQL engine. However, these 
libraries only support single node execution.
@@ -48,7 +48,7 @@ You can click below links for more related information.
 The overview chart is like below. Substrait provides a well-defined 
cross-language specification for data compute operations (see more details 
[here](https://substrait.io/)). Spark physical plan is transformed to Substrait 
plan. Then Substrait plan is passed to native through JNI call.
 On native side, the native operator chain will be built out and offloaded to 
native engine. Gluten will return Columnar Batch to Spark and Spark Columnar 
API (since Spark-3.0) will be used at execution time. Gluten uses Apache Arrow 
data format as its basic data format, so the returned data to Spark JVM is 
ArrowColumnarBatch.
 <p align="center">
-<img 
src="https://user-images.githubusercontent.com/47296334/199617207-1140698a-4d53-462d-9bc7-303d14be060b.png";
 width="800">
+<img src="/assets/images/gluten_framework.png" width="800">
 </p>
 Currently, Gluten only supports Clickhouse backend & Velox backend. Velox is a 
C++ database acceleration library which provides reusable, extensible and 
high-performance data processing components. More details can be found from 
https://github.com/facebookincubator/velox/. Gluten can also be extended to 
support more backends.
 
@@ -59,3 +59,121 @@ There are several key components in Gluten:
 * **Fallback Mechanism**: supports falling back to Vanilla spark for 
unsupported operators. Gluten ColumnarToRow (C2R) and RowToColumnar (R2C) will 
convert Gluten columnar data and Spark's internal row data if needed. Both C2R 
and R2C are implemented in native code as well
 * **Metrics**: collected from Gluten native engine to help identify bugs, 
performance bottlenecks, etc. The metrics are displayed in Spark UI.
 * **Shim Layer**: supports multiple Spark versions. We plan to only support 
Spark's latest 2 or 3 releases. Currently, Spark-3.2, Spark-3.3 & Spark-3.4 
(experimental) are supported.
+
+# 3 How to Use
+
+There are two methods for utilizing Gluten. The first is to use a pre-built 
JAR for a quick test of Gluten's acceleration capabilities. The second is to 
compile the JAR yourself, ensuring it runs on your target platform and delivers 
the best possible performance.
+
+# 3.1 Use a pre-Built Jar
+
+One way is to use released jar. Here is a simple example. Currently, only 
centos7/8 and ubuntu20.04/22.04 are well supported.
+You can find a pre-built gluten release jar in [Apache Gluten 
Download](https://downloads.apache.org/incubator/gluten/).
+Please be aware that the pre-built JAR is compiled using static linking via 
vcpkg, based on the Intel® Xeon® Gold 6252 processor.
+It should be compatible with most operating systems, including CentOS and 
Ubuntu, although performance is not guaranteed. 
+For the optimal performance experience, we recommend using the 3.2 Custom 
Build to compile Gluten from source tailored to your specific platform.
+
+```
+spark-shell \
+ --master yarn --deploy-mode client \
+ --conf spark.plugins=org.apache.gluten.GlutenPlugin \
+ --conf spark.memory.offHeap.enabled=true \
+ --conf spark.memory.offHeap.size=20g \
+ --conf 
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
+ --jars gluten-velox-bundle-spark3.2_2.12-centos_7_x86_64-1.2.0.jar
+```
+
+# 3.2 Custom Build
+
+To utilize Gluten with Spark, you can compile Gluten from the source and then 
configure it to enable the Gluten plugin. Below is a straightforward example. 
For more comprehensive instructions, please refer to the detailed guidance 
provided in the corresponding backend section.
+
+```
+export gluten_jar = /PATH/TO/GLUTEN/backends-velox/target/<gluten-jar>
+spark-shell \
+  --master yarn --deploy-mode client \
+  --conf spark.plugins=org.apache.gluten.GlutenPlugin \
+  --conf spark.memory.offHeap.enabled=true \
+  --conf spark.memory.offHeap.size=20g \
+  --conf spark.driver.extraClassPath=${gluten_jar} \
+  --conf spark.executor.extraClassPath=${gluten_jar} \
+  --conf 
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
+  ...
+```
+
+## 3.2.1 Build and install Gluten with Velox backend
+
+If you want to use Gluten **Velox** backend, see [Build with 
Velox](https://gluten.apache.org/docs/getting-started/velox-backend/) to build 
and install the necessary libraries.
+You can also find more information in [Gluten with Velox 
backend](https://gluten.apache.org/docs/velox-backend/) and 
[velox](https://github.com/facebookincubator/velox).
+
+## 3.2.2 Build and install Gluten with ClickHouse backend
+
+If you want to use Gluten **ClickHouse** backend, see [Build with ClickHouse 
Backend](http://gluten.apache.org/docs/getting-started/clickhouse-backend/).
+ClickHouse backend is developed by [Kyligence](https://kyligence.io/), please 
visit [Kyligence's ClickHouse](https://github.com/Kyligence/ClickHouse) for 
more infomation.
+
+## 3.2.3 Build options
+
+See [Build 
Parameters](https://gluten.apache.org/docs/getting-started/build-guide/).
+
+
+# 4 Join the Community
+
+## Contact Us
+
+Gluten was initiated by Intel and Kyligence in 2022. Several companies are 
also actively participating in the development, such as BIGO, Meituan, Alibaba 
Cloud, NetEase, Baidu, Microsoft, etc. If you are interested in Gluten project, 
please contact and subscribe below mailing lists for further discussion.
+
+Please see [contact us](https://gluten.apache.org/contact/) for more 
information.
+
+## Source Code
+
+Please see [gluten source code](https://github.com/apache/incubator-gluten) 
for more information.
+
+## How to Contribute to Gluten
+
+Please see [contributing guide](https://gluten.apache.org/contributing/) about 
how to make contributions.
+
+## Wechat Group
+
+There is a Wechat group (in Chinese) which may be more friendly for PRC 
developers/users. Due to the limitation of wechat group, please mail to 
[email protected]. 
+
+## Slack channel
+
+For Velox backend, we recommend to use [velox 
community](https://github.com/facebookincubator/velox?tab=readme-ov-file#community).
+
+
+# 5 Performance
+
+We use TPCH-like and TPCDS-like as decison support benchmarks to evaluate 
Gluten's performance.
+TPCH-like is a query set for modified from [TPC-H 
benchmark](http://tpc.org/tpch/default5.asp) and TPCDS-like is a query set for 
modified from [TPC-DS benchmark](https://tpc.org/tpcds/default5.asp). We use 
Parquet file format for Velox testing & MergeTree file format for Clickhouse 
testing, compared to Parquet file format as baseline. See [Decision Support 
Benchmark](https://github.com/apache/incubator-gluten/tree/main/tools/workload).
+
+## Gluten + Velox backend Performance
+
+The below test environment: single node with 3TB data on Intel® Xeon® Platinum 
8592+; Spark-3.3.1 for both baseline and Gluten.
+
+The TPCH-like result (tested in Mar. 2024) shows an overall speedup of 3.34x 
and up to 23.45x speedup in a single query with Gluten + Velox backend used.
+
+![Performance](/assets/images/velox_tpch-like_sf3000_time_comparison.png)
+
+![Performance](/assets/images/velox_tpch-like_sf3000_speedup.png)
+
+The TPCDS-like result (tested in Mar. 2024) shows an overall speedup of 3.02x 
and up to 13.75x speedup in a single query with Gluten + Velox backend used.
+
+![Performance](/assets/images/velox_tpcds-like_sf3000_time_comparison.png)
+
+![Performance](/assets/images/velox_tpcds-like_sf3000_top20_speedup.png)
+
+Notices & Disclaimers
+
+Performance varies by use, configuration and other factors. Learn more on the 
Performance Index site. Performance results are based on testing as of dates 
shown in configurations and may not reflect all publicly available updates.  
See backup for configuration details.  No product or component can be 
absolutely secure. Your costs and results may vary. Intel technologies may 
require enabled hardware, software or service activation. © Intel Corporation.  
Intel, the Intel logo, and other In [...]
+
+## Gluten + ClickHouse backend Performance
+
+The below testing environment: a 8-nodes AWS cluster with 1TB data; 
Spark-3.1.1 for both baseline and Gluten. The Decision Support Benchmark1 
result shows an average speedup of 2.12x and up to 3.48x speedup with Gluten 
Clickhouse backend.
+
+![Performance](/assets/images/clickhouse_decision_support_bench1_22queries_performance.png)
+
+
+# Thanks to our contributors
+
+<a href="https://github.com/apache/incubator-gluten/graphs/contributors";>
+  <img 
src="https://contrib.rocks/image?repo=apache/incubator-gluten&columns=25"; />
+</a>
+
diff --git a/release.md b/release.md
index af8f01b..42c18ba 100644
--- a/release.md
+++ b/release.md
@@ -1,7 +1,8 @@
 ---
 layout: page
-title: Gluten Release
+title: Gluten Releases
 nav_order: 5
+permalink: /releases/
 ---
 
 [Gluten](https://github.com/apache/incubator-gluten) is a plugin for Apache 
Spark to double SparkSQL's performance.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(incubator-gluten-site) branch main updated: update index.md for chart update and latest performance result (#23)

Reply via email to