This is an automated email from the ASF dual-hosted git repository.
weitingchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-gluten-site.git
The following commit(s) were added to refs/heads/main by this push:
new bffb8fc refresh website design and document to v1.2.0 (#21)
bffb8fc is described below
commit bffb8fc97c6b8e8033def2a05dd360248f7a6131
Author: Wei-Ting Chen <[email protected]>
AuthorDate: Sun Aug 18 17:27:32 2024 +0800
refresh website design and document to v1.2.0 (#21)
---
404.html | 4 +-
_config.yml | 4 +-
archives/v1.1.1/developers/HowTo.md | 4 +-
archives/v1.1.1/developers/MicroBenchmarks.md | 8 +-
archives/v1.1.1/docs/GettingStarted_Velox.md | 26 +--
asf.md | 2 +-
assets/images/ClickHouse/CLion-Configuration-1.png | Bin 0 -> 162137 bytes
assets/images/ClickHouse/CLion-Configuration-2.png | Bin 0 -> 291104 bytes
assets/images/ClickHouse/CLion-Configuration-3.png | Bin 0 -> 11741 bytes
.../ClickHouse/ClickHouse-Backend-Architecture.png | Bin 0 -> 76190 bytes
.../Gluten-ClickHouse-Backend-Q6-DAG.png | Bin 0 -> 76396 bytes
assets/images/ClickHouse/cpp-ch-configuration.png | Bin 0 -> 254071 bytes
assets/images/Gazelle-jni.png | Bin 0 -> 34960 bytes
assets/images/TPC-H_Q6_DAG.png | Bin 0 -> 71841 bytes
assets/images/TPCH-q5-first-stage.png | Bin 0 -> 75037 bytes
.../v1.1.1 => assets}/images/apache-incubator.svg | 0
...decision_support_bench1_10query_performance.png | Bin 0 -> 59581 bytes
...cision_support_bench1_22queries_performance.png | Bin 0 -> 63448 bytes
assets/images/flow.png | Bin 0 -> 22177 bytes
.../v1.1.1 => assets}/images/gluten-logo-blue.png | Bin
assets/images/gluten-logo.svg | 9 +
assets/images/gluten-ui.png | Bin 0 -> 222262 bytes
assets/images/gluten.png | Bin 0 -> 136886 bytes
assets/images/gluten_golden_file_upload.png | Bin 0 -> 175664 bytes
assets/images/operators.png | Bin 0 -> 21905 bytes
assets/images/overall_design.png | Bin 0 -> 20447 bytes
assets/images/reproduce_natively.png | Bin 0 -> 36854 bytes
assets/images/support.png | Bin 0 -> 39160 bytes
...decision_support_bench1_10query_performance.png | Bin 0 -> 59353 bytes
...cision_support_bench1_22queries_performance.png | Bin 0 -> 65142 bytes
assets/images/veloxbe_memory_layout.png | Bin 0 -> 81914 bytes
contact-us.md | 2 +-
contributing.md | 4 +-
docs/developers/HowTo.md | 18 +-
docs/developers/HowToRelease.md | 14 +-
docs/developers/MicroBenchmarks.md | 22 +-
docs/developers/NewToGluten.md | 12 +-
docs/developers/SubstraitModifications.md | 24 +--
docs/developers/docker_centos7.md | 4 +-
docs/developers/docker_centos8.md | 4 +-
docs/developers/docker_ubuntu22.04.md | 4 +-
docs/get-started/ClickHouse.md | 48 ++---
docs/get-started/Velox.md | 30 +--
docs/get-started/VeloxABFS.md | 2 +-
docs/get-started/VeloxGCS.md | 8 +-
docs/get-started/VeloxLocalCache.md | 2 +-
docs/velox-backend/velox-backend-limitations.md | 4 +-
.../velox-backend-support-progress.md | 2 +-
.../velox-backend/velox-backend-troubleshooting.md | 2 +-
docs/velox-backend/velox-backend-udf.md | 239 +++++++++++++++++++++
images/apache-incubator.svg | 1 -
images/gluten-logo-blue.png | Bin 14469 -> 0 bytes
index.md | 7 +-
references.md | 32 +++
54 files changed, 411 insertions(+), 131 deletions(-)
diff --git a/404.html b/404.html
index 086a5c9..d00750a 100644
--- a/404.html
+++ b/404.html
@@ -20,6 +20,6 @@ layout: default
<div class="container">
<h1>404</h1>
- <p><strong>Page not found :(</strong></p>
- <p>The requested page could not be found.</p>
+ <p><strong>Gluten Page not found :(</strong></p>
+ <p>The requested gluten webiste's page could not be found.</p>
</div>
diff --git a/_config.yml b/_config.yml
index 59054db..0ed7881 100644
--- a/_config.yml
+++ b/_config.yml
@@ -40,7 +40,7 @@ plugins:
- jekyll-readme-index # GitHub Pages
- jekyll-relative-links # GitHub Pages
-logo: "/images/gluten-logo-blue.png"
+logo: "/assets/images/gluten-logo-blue.png"
# Exclude from processing.
# The following items will not be processed, by default.
@@ -120,7 +120,7 @@ heading_anchors: true
# hide_icon: false # set to true to hide the external link icon - defaults
to false
# opens_in_new_tab: false # set to true to open this link in a new tab -
defaults to false
-color_scheme: dark
+color_scheme: light
# Footer content
# appears at the bottom of every page's main content
diff --git a/archives/v1.1.1/developers/HowTo.md
b/archives/v1.1.1/developers/HowTo.md
index 8911082..fbb90c8 100644
--- a/archives/v1.1.1/developers/HowTo.md
+++ b/archives/v1.1.1/developers/HowTo.md
@@ -117,7 +117,7 @@ gdb gluten_home/cpp/build/releases/libgluten.so
'core-Executor task l-2000883-16
Now, both Parquet and DWRF format files are supported, related scripts and
files are under the directory of `gluten_home/backends-velox/workload/tpch`.
The file `README.md` under `gluten_home/backends-velox/workload/tpch` offers
some useful help but it's still not enough and exact.
-One way of run TPC-H test is to run velox-be by workflow, you can refer to
[velox_be.yml](https://github.com/oap-project/gluten/blob/main/.github/workflows/velox_be.yml#L90)
+One way of run TPC-H test is to run velox-be by workflow, you can refer to
[velox_be.yml](https://github.com/apache/incubator-gluten/blob/branch-1.1.1/.github/workflows/velox_be.yml#L90)
Here will explain how to run TPC-H on Velox backend with the Parquet file
format.
1. First step, prepare the datasets, you have two choices.
@@ -136,7 +136,7 @@ Here will explain how to run TPC-H on Velox backend with
the Parquet file format
var gluten_root = "/home/gluten"
```
- Modify `gluten_home/backends-velox/workload/tpch/run_tpch/tpch_parquet.sh`.
- - Set `GLUTEN_JAR` correctly. Please refer to the section of [Build Gluten
with Velox Backend](../get-started/Velox.md/#2-build-gluten-with-velox-backend)
+ - Set `GLUTEN_JAR` correctly. Please refer to the section of [Build Gluten
with Velox
Backend](https://gluten.apache.org/archives/v1.1.1/docs/velox/getting-started/#build-gluten-with-velox-backend)
- Set `SPARK_HOME` correctly.
- Set the memory configurations appropriately.
- Execute `tpch_parquet.sh` using the below command.
diff --git a/archives/v1.1.1/developers/MicroBenchmarks.md
b/archives/v1.1.1/developers/MicroBenchmarks.md
index da24278..86dea04 100644
--- a/archives/v1.1.1/developers/MicroBenchmarks.md
+++ b/archives/v1.1.1/developers/MicroBenchmarks.md
@@ -258,15 +258,15 @@ done
### Run Examples
-We also provide some example inputs in
[cpp/velox/benchmarks/data](../../cpp/velox/benchmarks/data).
-E.g.
[generic_q5/q5_first_stage_0.json](../../cpp/velox/benchmarks/data/generic_q5/q5_first_stage_0.json)
simulates a
+We also provide some example inputs in
[cpp/velox/benchmarks/data](https://github.com/apache/incubator-gluten/tree/branch-1.1.1/cpp/velox/benchmarks/data).
+E.g.
[generic_q5/q5_first_stage_0.json](https://github.com/apache/incubator-gluten/blob/branch-1.1.1/cpp/velox/benchmarks/data/generic_q5/q5_first_stage_0.json)
simulates a
first-stage in TPCH Q5, which has the the most heaviest table scan. You can
follow below steps to run this example.
-1. Open
[generic_q5/q5_first_stage_0.json](../../cpp/velox/benchmarks/data/generic_q5/q5_first_stage_0_split.json)
with
+1. Open
[generic_q5/q5_first_stage_0.json](https://github.com/apache/incubator-gluten/blob/branch-1.1.1/cpp/velox/benchmarks/data/generic_q5/q5_first_stage_0_split.json)
with
file editor. Search for `"uriFile": "LINEITEM"` and replace `LINEITEM` with
the URI to one partition file in
lineitem. In the next line, replace the number in `"length": "..."` with
the actual file length. Suppose you are
using the provided small TPCH table
- in
[cpp/velox/benchmarks/data/tpch_sf10m](../../cpp/velox/benchmarks/data/tpch_sf10m),
the replaced JSON should be
+ in
[cpp/velox/benchmarks/data/tpch_sf10m](https://github.com/apache/incubator-gluten/tree/branch-1.1.1/cpp/velox/benchmarks/data/tpch_sf10m),
the replaced JSON should be
like:
```
diff --git a/archives/v1.1.1/docs/GettingStarted_Velox.md
b/archives/v1.1.1/docs/GettingStarted_Velox.md
index 017ecef..7f66d14 100644
--- a/archives/v1.1.1/docs/GettingStarted_Velox.md
+++ b/archives/v1.1.1/docs/GettingStarted_Velox.md
@@ -379,7 +379,7 @@ The following steps demonstrate how to set up a UDF library
project:
- The interface functions are mapping to marcos in
[Udf.h](../../cpp/velox/udf/Udf.h). Here's an example of how to implement these
functions:
- ```
+ ```cpp
// Filename MyUDF.cpp
#include <velox/expression/VectorFunction.h>
@@ -415,7 +415,7 @@ The following steps demonstrate how to set up a UDF library
project:
## Building the UDF library
To build the UDF library, users need to compile the C++ code and link to
`libvelox.so`. It's recommended to create a CMakeLists.txt for the project.
Here's an example:
-```
+```cpp
project(myudf)
set(CMAKE_CXX_STANDARD 17)
@@ -465,16 +465,16 @@ You can also specify the local or HDFS URIs to the UDF
libraries or archives. Lo
We provided an Velox UDF example file
[MyUDF.cpp](../../cpp/velox/udf/examples/MyUDF.cpp). After building gluten cpp,
you can find the example library at
/path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so
Start spark-shell or spark-sql with below configuration
-```
+```shell
--files /path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so
--conf spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=libmyudf.so
```
Run query. The functions `myudf1` and `myudf2` increment the input value by a
constant of 5
-```
+```sql
select myudf1(1), myudf2(100L)
```
The output from spark-shell will be like
-```
+```sql
+----------------+------------------+
|udfexpression(1)|udfexpression(100)|
+----------------+------------------+
@@ -629,14 +629,14 @@ There is 8 QAT acceleration device(s) in the system:
3. Extra Gluten configurations are required when starting Spark application
-```
+```shell
--conf spark.gluten.sql.columnar.shuffle.codec=gzip # Valid options are gzip
and zstd
--conf spark.gluten.sql.columnar.shuffle.codecBackend=qat
```
4. You can use below command to check whether QAT is working normally at
run-time. The value of fw_counters should continue to increase during shuffle.
-```
+```shell
while :; do cat /sys/kernel/debug/qat_4xxx_0000:6b:00.0/fw_counters; sleep 1;
done
```
@@ -697,7 +697,7 @@ sudo ls -l /dev/iax
```
The output should be like:
-```
+```bash
total 0
crw-rw---- 1 root iaa 509, 0 Apr 5 18:54 wq1.0
crw-rw---- 1 root iaa 509, 5 Apr 5 18:54 wq11.0
@@ -711,7 +711,7 @@ crw-rw---- 1 root iaa 509, 4 Apr 5 18:54 wq9.0
2. Extra Gluten configurations are required when starting Spark application
-```
+```bash
--conf spark.gluten.sql.columnar.shuffle.codec=gzip
--conf spark.gluten.sql.columnar.shuffle.codecBackend=iaa
```
@@ -746,7 +746,7 @@ Some other versions of TPC-DS queries are also provided,
but are **not** recomme
Submit test script from spark-shell. You can find the scala code to [Run
TPC-H](../../tools/workload/tpch/run_tpch/tpch_parquet.scala) as an example.
Please remember to modify
the location of TPC-H files as well as TPC-H queries before you run the
testing.
-```
+```scala
var parquet_file_path = "/PATH/TO/TPCH_PARQUET_PATH"
var gluten_root = "/PATH/TO/GLUTEN"
```
@@ -777,7 +777,7 @@ Refer to [Gluten configuration](../Configuration.md) for
more details.
## Result
*wholestagetransformer* indicates that the offload works.
-
+
## Performance
@@ -811,7 +811,7 @@ Developers can register `SparkListener` to handle these two
Gluten events.
Gluten provides a tab based on Spark UI, named `Gluten SQL / DataFrame`
-
+
This tab contains two parts:
@@ -906,7 +906,7 @@ ashProbe: Input: 9 rows (864B, 3 batches), Output: 27 rows
(3.56KB, 3 batches),
Gluten provides a helper class to get the fallback summary from a Spark
Dataset.
-```
+```scala
import org.apache.spark.sql.execution.GlutenImplicits._
val df = spark.sql("SELECT * FROM t")
df.fallbackSummary
diff --git a/asf.md b/asf.md
index 903ee09..00e25af 100644
--- a/asf.md
+++ b/asf.md
@@ -1,7 +1,7 @@
---
layout: page
title: Apache Software Foundation
-nav_order: 7
+nav_order: 8
permalink: /asf/
---
diff --git a/assets/images/ClickHouse/CLion-Configuration-1.png
b/assets/images/ClickHouse/CLion-Configuration-1.png
new file mode 100644
index 0000000..14c739d
Binary files /dev/null and b/assets/images/ClickHouse/CLion-Configuration-1.png
differ
diff --git a/assets/images/ClickHouse/CLion-Configuration-2.png
b/assets/images/ClickHouse/CLion-Configuration-2.png
new file mode 100644
index 0000000..6fde6da
Binary files /dev/null and b/assets/images/ClickHouse/CLion-Configuration-2.png
differ
diff --git a/assets/images/ClickHouse/CLion-Configuration-3.png
b/assets/images/ClickHouse/CLion-Configuration-3.png
new file mode 100644
index 0000000..f2b2028
Binary files /dev/null and b/assets/images/ClickHouse/CLion-Configuration-3.png
differ
diff --git a/assets/images/ClickHouse/ClickHouse-Backend-Architecture.png
b/assets/images/ClickHouse/ClickHouse-Backend-Architecture.png
new file mode 100644
index 0000000..d2becdc
Binary files /dev/null and
b/assets/images/ClickHouse/ClickHouse-Backend-Architecture.png differ
diff --git a/assets/images/ClickHouse/Gluten-ClickHouse-Backend-Q6-DAG.png
b/assets/images/ClickHouse/Gluten-ClickHouse-Backend-Q6-DAG.png
new file mode 100644
index 0000000..b826c90
Binary files /dev/null and
b/assets/images/ClickHouse/Gluten-ClickHouse-Backend-Q6-DAG.png differ
diff --git a/assets/images/ClickHouse/cpp-ch-configuration.png
b/assets/images/ClickHouse/cpp-ch-configuration.png
new file mode 100644
index 0000000..0be4685
Binary files /dev/null and b/assets/images/ClickHouse/cpp-ch-configuration.png
differ
diff --git a/assets/images/Gazelle-jni.png b/assets/images/Gazelle-jni.png
new file mode 100644
index 0000000..1417e4e
Binary files /dev/null and b/assets/images/Gazelle-jni.png differ
diff --git a/assets/images/TPC-H_Q6_DAG.png b/assets/images/TPC-H_Q6_DAG.png
new file mode 100644
index 0000000..c4f32dc
Binary files /dev/null and b/assets/images/TPC-H_Q6_DAG.png differ
diff --git a/assets/images/TPCH-q5-first-stage.png
b/assets/images/TPCH-q5-first-stage.png
new file mode 100644
index 0000000..3955a59
Binary files /dev/null and b/assets/images/TPCH-q5-first-stage.png differ
diff --git a/archives/v1.1.1/images/apache-incubator.svg
b/assets/images/apache-incubator.svg
similarity index 100%
rename from archives/v1.1.1/images/apache-incubator.svg
rename to assets/images/apache-incubator.svg
diff --git
a/assets/images/clickhouse_decision_support_bench1_10query_performance.png
b/assets/images/clickhouse_decision_support_bench1_10query_performance.png
new file mode 100644
index 0000000..745f2f6
Binary files /dev/null and
b/assets/images/clickhouse_decision_support_bench1_10query_performance.png
differ
diff --git
a/assets/images/clickhouse_decision_support_bench1_22queries_performance.png
b/assets/images/clickhouse_decision_support_bench1_22queries_performance.png
new file mode 100644
index 0000000..b397ada
Binary files /dev/null and
b/assets/images/clickhouse_decision_support_bench1_22queries_performance.png
differ
diff --git a/assets/images/flow.png b/assets/images/flow.png
new file mode 100644
index 0000000..e47f724
Binary files /dev/null and b/assets/images/flow.png differ
diff --git a/archives/v1.1.1/images/gluten-logo-blue.png
b/assets/images/gluten-logo-blue.png
similarity index 100%
rename from archives/v1.1.1/images/gluten-logo-blue.png
rename to assets/images/gluten-logo-blue.png
diff --git a/assets/images/gluten-logo.svg b/assets/images/gluten-logo.svg
new file mode 100644
index 0000000..aeee80b
--- /dev/null
+++ b/assets/images/gluten-logo.svg
@@ -0,0 +1,9 @@
+<svg width="336" height="116" viewBox="0 0 336 116" fill="none"
xmlns="http://www.w3.org/2000/svg">
+<path d="M152.039 78.3122V70.5858H169.407V89.0819C167.776 91.5467 165.426
93.4478 162.676 94.5257C159.452 95.8963 155.979 96.5823 152.477 96.5403C148.358
96.6155 144.294 95.5869 140.706 93.5613C137.304 91.6011 134.528 88.7135 132.701
85.2348C130.754 81.4587 129.775 77.2568 129.854 73.0077C129.765 68.7221 130.779
64.4856 132.797 60.7055C134.663 57.246 137.467 54.3854 140.888 52.4541C144.367
50.4919 148.302 49.4828 152.295 49.5286C156.114 49.4377 159.881 50.4216 163.168
52.3684C166.02 53.9 [...]
+<path d="M184.803 96.9631H176.242V49.7156H184.803V96.9631Z" fill="#101010"/>
+<path d="M214.675 63.5989H223.29V95.7473H215.659C215.499 93.2612 215.328
91.5894 215.167 90.7214C214.209 92.5479 212.745 94.0586 210.951 95.0722C209.119
96.0421 207.072 96.5324 205.001 96.4974C203.416 96.5786 201.832 96.3085 200.364
95.7065C198.895 95.1045 197.577 94.1854 196.504 93.0147C194.442 90.7 193.407
87.3066 193.4 82.8343V63.5453H202.026V79.9303C202.026 85.8527 203.927 88.814
207.73 88.814C208.698 88.8384 209.659 88.6315 210.531 88.2104C211.404 87.7892
212.164 87.1659 212.749 86. [...]
+<path d="M230.449
63.5989V53.7186H239.139V63.5989H248.77V70.8645H239.139V82.3736C239.013 84.0789
239.385 85.7841 240.209 87.2816C240.609 87.8525 241.152 88.3083 241.782
88.6035C242.413 88.8988 243.111 89.023 243.805 88.964C245.269 88.9557 246.695
88.4995 247.893 87.6567L250.868 93.8613C248.727 95.6045 245.792 96.4761 242.06
96.4761C239.016 96.6193 236.025 95.6351 233.66 93.7113C231.519 91.8681 230.449
88.5675 230.449 83.8096V63.5989Z" fill="#101010"/>
+<path d="M276.755 64.6812C278.95 65.9342 280.737 67.7946 281.902
70.0393C283.157 72.487 283.785 75.2079 283.732 77.9585C283.728 79.3137 283.602
80.6657 283.358 81.9985H262.019C262.225 84.0605 263.149 85.984 264.63
87.4316C266.03 88.6672 267.848 89.3225 269.713 89.2641C271.067 89.2874 272.402
88.9549 273.587 88.2996C274.742 87.6354 275.662 86.628 276.22 85.417L282.544
89.1355C281.648 91.5332 279.862 93.4913 277.557 94.6007C275.016 95.9057 272.194
96.568 269.339 96.5296C266.536 96.5504 263 [...]
+<path d="M316.907 66.2994C318.969 68.6355 320 72.0397 320
76.5119V95.801H311.385V79.3088C311.385 73.4292 309.441 70.4894 305.553
70.4894C304.592 70.462 303.638 70.6645 302.771 71.0801C301.903 71.4957 301.147
72.1124 300.566 72.8791C299.19 74.8461 298.522 77.2227 298.672
79.6196V95.7581H290.057V63.6096H297.623C297.826 66.1279 298.008 67.8961 298.179
68.7641C300.241 64.7848 303.673 62.7952 308.474 62.7952C310.051 62.7204 311.625
62.9958 313.083 63.6016C314.541 64.2074 315.847 65.1287 316.9 [...]
+<path d="M42.649 92.2652H52.6033V70.0424C52.6091 65.9408 53.9646 61.9554
56.4599 58.7022C56.947 58.0619 57.4774 57.4559 58.0475 56.8884C59.7708 55.1577
61.8189 53.7853 64.074 52.85C66.3292 51.9147 68.7468 51.435 71.1878
51.4385H93.189C95.6286 51.4361 98.0446 51.9164 100.298 52.8516C102.552 53.7869
104.598 55.1588 106.321 56.8884C106.891 57.4559 107.421 58.0619 107.908
58.7022C110.416 61.9488 111.773 65.9386 111.765 70.0424V89.7173C111.785 92.5738
110.675 95.322 108.676 97.361L108.633 97. [...]
+</svg>
diff --git a/assets/images/gluten-ui.png b/assets/images/gluten-ui.png
new file mode 100644
index 0000000..fd2e083
Binary files /dev/null and b/assets/images/gluten-ui.png differ
diff --git a/assets/images/gluten.png b/assets/images/gluten.png
new file mode 100644
index 0000000..4f3f2db
Binary files /dev/null and b/assets/images/gluten.png differ
diff --git a/assets/images/gluten_golden_file_upload.png
b/assets/images/gluten_golden_file_upload.png
new file mode 100644
index 0000000..c142fbe
Binary files /dev/null and b/assets/images/gluten_golden_file_upload.png differ
diff --git a/assets/images/operators.png b/assets/images/operators.png
new file mode 100644
index 0000000..c72d84e
Binary files /dev/null and b/assets/images/operators.png differ
diff --git a/assets/images/overall_design.png b/assets/images/overall_design.png
new file mode 100644
index 0000000..9cc9e02
Binary files /dev/null and b/assets/images/overall_design.png differ
diff --git a/assets/images/reproduce_natively.png
b/assets/images/reproduce_natively.png
new file mode 100644
index 0000000..5f58e83
Binary files /dev/null and b/assets/images/reproduce_natively.png differ
diff --git a/assets/images/support.png b/assets/images/support.png
new file mode 100644
index 0000000..d67e4d1
Binary files /dev/null and b/assets/images/support.png differ
diff --git
a/assets/images/velox_decision_support_bench1_10query_performance.png
b/assets/images/velox_decision_support_bench1_10query_performance.png
new file mode 100644
index 0000000..b6cc7b4
Binary files /dev/null and
b/assets/images/velox_decision_support_bench1_10query_performance.png differ
diff --git
a/assets/images/velox_decision_support_bench1_22queries_performance.png
b/assets/images/velox_decision_support_bench1_22queries_performance.png
new file mode 100644
index 0000000..13c9446
Binary files /dev/null and
b/assets/images/velox_decision_support_bench1_22queries_performance.png differ
diff --git a/assets/images/veloxbe_memory_layout.png
b/assets/images/veloxbe_memory_layout.png
new file mode 100644
index 0000000..47c1bad
Binary files /dev/null and b/assets/images/veloxbe_memory_layout.png differ
diff --git a/contact-us.md b/contact-us.md
index 19160ed..63f9bcb 100644
--- a/contact-us.md
+++ b/contact-us.md
@@ -1,7 +1,7 @@
---
layout: page
title: Contact Us
-nav_order: 7
+nav_order: 9
---
# Contact Us
diff --git a/contributing.md b/contributing.md
index 1ade03c..24feb44 100644
--- a/contributing.md
+++ b/contributing.md
@@ -1,10 +1,10 @@
---
layout: page
title: Contributing to Gluten
-nav_order: 6
+nav_order: 7
---
-## How to become a committer
+# How to become a committer
To initiate your contributions to Gluten, understand the contribution
process—any individual can submit patches, documentation, and examples to the
project.
diff --git a/docs/developers/HowTo.md b/docs/developers/HowTo.md
index 9c7401a..d9e3dc5 100644
--- a/docs/developers/HowTo.md
+++ b/docs/developers/HowTo.md
@@ -44,14 +44,14 @@ To debug C++, you have to generate the example files, the
example files consist
You can generate the example files by the following steps:
1. build Velox and Gluten CPP
-```
+```bash
gluten_home/dev/builddeps-veloxbe.sh --build_tests=ON --build_benchmarks=ON
--build_type=Debug
```
- Compiling with `--build_type=Debug` is good for debugging.
- The executable file `generic_benchmark` will be generated under the
directory of `gluten_home/cpp/build/velox/benchmarks/`.
2. build Gluten and generate the example files
-```
+```bash
cd gluten_home
mvn clean package -Pspark-3.2 -Pbackends-velox -Prss
mvn test -Pspark-3.2 -Pbackends-velox -Prss -pl backends-velox -am
-DtagsToInclude="io.glutenproject.tags.GenerateExample" -Dtest=none
-DfailIfNoTests=false -Darrow.version=11.0.0-gluten -Dexec.skip
@@ -72,7 +72,7 @@ gluten_home/backends-velox/generated-native-benchmark/
```
3. now, run benchmarks with GDB
-```
+```bash
cd gluten_home/cpp/build/velox/benchmarks/
gdb generic_benchmark
```
@@ -91,7 +91,7 @@ gdb generic_benchmark
will be used as default.
- You can also edit the file `example.json` to custom the Substrait plan or
specify the inputs files placed in the other directory.
-6. get more detail information about benchmarks from
[MicroBenchmarks](./MicroBenchmarks.md)
+6. get more detail information about benchmarks from
[MicroBenchmarks](https://gluten.apache.org/docs/developers/microbenchmarks/#generate-micro-benchmarks-for-velox-backend)
## 2 How to debug plan validation process
Gluten will validate generated plan before execute it, and validation usually
happens in native side, so we provide a utility to help debug validation
process in native side.
@@ -105,7 +105,7 @@ wait to add
## 4 How to debug with core-dump
wait to complete
-```
+```bash
cd the_directory_of_core_file_generated
gdb gluten_home/cpp/build/releases/libgluten.so 'core-Executor task
l-2000883-1671542526'
@@ -117,7 +117,7 @@ gdb gluten_home/cpp/build/releases/libgluten.so
'core-Executor task l-2000883-16
Now, both Parquet and DWRF format files are supported, related scripts and
files are under the directory of `gluten_home/backends-velox/workload/tpch`.
The file `README.md` under `gluten_home/backends-velox/workload/tpch` offers
some useful help but it's still not enough and exact.
-One way of run TPC-H test is to run velox-be by workflow, you can refer to
[velox_be.yml](https://github.com/oap-project/gluten/blob/main/.github/workflows/velox_be.yml#L90)
+One way of run TPC-H test is to run velox with docker by workflow, you can
refer to
[velox_docker.yml](https://github.com/apache/incubator-gluten/blob/main/.github/workflows/velox_docker.yml)
Here will explain how to run TPC-H on Velox backend with the Parquet file
format.
1. First step, prepare the datasets, you have two choices.
@@ -128,15 +128,15 @@ Here will explain how to run TPC-H on Velox backend with
the Parquet file format
2. Second step, run TPC-H on Velox backend testing.
- Modify
`gluten_home/backends-velox/workload/tpch/run_tpch/tpch_parquet.scala`.
- set `var parquet_file_path` to correct directory. If using the small
dataset directly in the step one, then modify it as below
- ```
+ ```scala
var parquet_file_path =
"gluten_home/backends-velox/src/test/resources/tpch-data-parquet-velox"
```
- set `var gluten_root` to correct directory. If `gluten_home` is the
directory of `/home/gluten`, then modify it as below
- ```
+ ```scala
var gluten_root = "/home/gluten"
```
- Modify `gluten_home/backends-velox/workload/tpch/run_tpch/tpch_parquet.sh`.
- - Set `GLUTEN_JAR` correctly. Please refer to the section of [Build Gluten
with Velox Backend](../get-started/Velox.md/#2-build-gluten-with-velox-backend)
+ - Set `GLUTEN_JAR` correctly. Please refer to the section of [Build Gluten
with Velox
Backend](http://gluten.apache.org/docs/getting-started/velox-backend/#build-gluten-with-velox-backend)
- Set `SPARK_HOME` correctly.
- Set the memory configurations appropriately.
- Execute `tpch_parquet.sh` using the below command.
diff --git a/docs/developers/HowToRelease.md b/docs/developers/HowToRelease.md
index 8b50e34..d945e2f 100644
--- a/docs/developers/HowToRelease.md
+++ b/docs/developers/HowToRelease.md
@@ -3,6 +3,8 @@ layout: page
title: How To Release
nav_order: 10
parent: Developers
+grand_parent: Documentations
+permalink: /docs/developers/how-to-release/
---
# How to Release
@@ -42,7 +44,7 @@ All projects under the Apache umbrella must adhere to the
[Apache Release Policy
3. Sign the release artifacts with the GPG key.
-```
+```bash
# create a GPG key, after executing this command, select the first one RSA 和
RSA
$ gpg --full-generate-key
@@ -66,7 +68,7 @@ $ for i in *.tar.gz; do echo $i; gpg --local-user xxxx
--armor --output $i.asc -
#### How to Generate checksums for the release artifacts.
-```
+```bash
# create the checksums
$ for i in *.tar.gz; do echo $i; sha512sum $i > $i.sha512 ; done
```
@@ -82,7 +84,7 @@ $ for i in *.tar.gz; do echo $i; sha512sum $i > $i.sha512 ;
done
release-version format: apache-gluten-#.#.#-rc#
3. Upload the release artifacts to the SVN repository.
-```
+```bash
$ svn co https://dist.apache.org/repos/dist/dev/incubator/gluten/
$ cp /path/to/release/artifacts/* ./{release-version}/
$ svn add ./{release-version}/*
@@ -91,7 +93,7 @@ $ svn commit -m "add Apache Answer release artifacts for
{release-version}"
4. After the upload, please visit the link
`https://dist.apache.org/repos/dist/dev/incubator/gluten/{release-version}` to
verify if the file upload is successful or not.
The upload release artifacts should be include
-```
+```bash
* apache-gluten-#.#.#-incubating-src.tar.gz
* apache-gluten-#.#.#-incubating-src.tar.gz.asc
* apache-gluten-#.#.#-incubating-src.tar.gz.sha512
@@ -119,7 +121,7 @@ Please follow below steps to verify the release artifacts.
Please follow below steps to verify the signatures.
-```
+```bash
# download KEYS
$ curl https://dist.apache.org/repos/dist/release/incubator/gluten/KEYS > KEYS
@@ -144,7 +146,7 @@ $ for i in *.tar.gz; do echo $i; gpg --verify $i.asc $i ;
done
#### How to Verify the checksums
Please follow below steps to verify the checksums
-```
+```bash
# verify the checksums
$ for i in *.tar.gz; do echo $i; sha512sum --check $i.sha512; done
```
diff --git a/docs/developers/MicroBenchmarks.md
b/docs/developers/MicroBenchmarks.md
index af99705..eb02c3d 100644
--- a/docs/developers/MicroBenchmarks.md
+++ b/docs/developers/MicroBenchmarks.md
@@ -217,12 +217,12 @@ following argument flags to the command:
- --iaa-gzip: IAA GZIP codec, compression level 1
Note using QAT or IAA codec requires Gluten cpp is built with these features.
-Please check the corresponding section in [Velox
document](../get-started/Velox.md) first for how to setup, build and
+Please check the corresponding section in [Velox
document](http://gluten.apache.org/docs/getting-started/velox-backend/) first
for how to setup, build and
enable these features in Gluten.
For QAT support, please
-check [Intel® QuickAssist Technology (QAT)
support](../get-started/Velox.md#intel-quickassist-technology-qat-support).
+check [Intel® QuickAssist Technology (QAT)
support](http://gluten.apache.org/docs/getting-started/velox-backend/#intel-quickassist-technology-qat-support).
For IAA support, please
-check [Intel® In-memory Analytics Accelerator (IAA/IAX)
support](../get-started/Velox.md#intel-in-memory-analytics-accelerator-iaaiax-support)
+check [Intel® In-memory Analytics Accelerator (IAA/IAX)
support](http://gluten.apache.org/docs/getting-started/velox-backend/#intel-in-memory-analytics-accelerator-iaaiax-support)
## Simulate Spark with multiple processes and threads
@@ -258,18 +258,18 @@ done
### Run Examples
-We also provide some example inputs in
[cpp/velox/benchmarks/data](../../cpp/velox/benchmarks/data).
-E.g.
[generic_q5/q5_first_stage_0.json](../../cpp/velox/benchmarks/data/generic_q5/q5_first_stage_0.json)
simulates a
+We also provide some example inputs in
[cpp/velox/benchmarks/data](https://github.com/apache/incubator-gluten/tree/main/cpp/velox/benchmarks/data).
+E.g.
[generic_q5/q5_first_stage_0.json](https://github.com/apache/incubator-gluten/tree/main/cpp/velox/benchmarks/data/generic_q5/q5_first_stage_0.json)
simulates a
first-stage in TPCH Q5, which has the the most heaviest table scan. You can
follow below steps to run this example.
-1. Open
[generic_q5/q5_first_stage_0.json](../../cpp/velox/benchmarks/data/generic_q5/q5_first_stage_0_split.json)
with
+1. Open
[generic_q5/q5_first_stage_0.json](https://github.com/apache/incubator-gluten/tree/main/cpp/velox/benchmarks/data/generic_q5/q5_first_stage_0_split.json)
with
file editor. Search for `"uriFile": "LINEITEM"` and replace `LINEITEM` with
the URI to one partition file in
lineitem. In the next line, replace the number in `"length": "..."` with
the actual file length. Suppose you are
using the provided small TPCH table
- in
[cpp/velox/benchmarks/data/tpch_sf10m](../../cpp/velox/benchmarks/data/tpch_sf10m),
the replaced JSON should be
+ in
[cpp/velox/benchmarks/data/tpch_sf10m](https://github.com/apache/incubator-gluten/tree/main/cpp/velox/benchmarks/data/tpch_sf10m),
the replaced JSON should be
like:
-```
+```json
{
"items": [
{
@@ -283,7 +283,7 @@ first-stage in TPCH Q5, which has the the most heaviest
table scan. You can foll
2. Launch multiple processes and multiple threads. Set
`GLUTEN_SPARK_LOCAL_DIRS` and add --with-shuffle to the command.
-```
+```bash
mkdir -p {/data1,/data2,/data3}/tmp # Make sure each directory has been
already created.
export GLUTEN_SPARK_LOCAL_DIRS=/data1/tmp,/data2/tmp,/data3/tmp
@@ -298,11 +298,11 @@ done >stdout.log 2>stderr.log
You can find the "elapsed_time" and other metrics in stdout.log. In below
output, the "elapsed_time" is ~10.75s. If you
run TPCH Q5 with Gluten on Spark, a single task in the same Spark stage should
take about the same time.
-```
+```bash
------------------------------------------------------------------------------------------------------------------
Benchmark Time
CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
SkipInput/iterations:1/process_time/real_time/threads:8 1317255379 ns
10061941861 ns 8 collect_batch_time=0 elapsed_time=10.7563G
shuffle_compress_time=4.19964G shuffle_spill_time=0 shuffle_split_time=0
shuffle_write_time=1.91651G
```
-
+
diff --git a/docs/developers/NewToGluten.md b/docs/developers/NewToGluten.md
index bdd83b6..78bdc89 100644
--- a/docs/developers/NewToGluten.md
+++ b/docs/developers/NewToGluten.md
@@ -55,7 +55,7 @@ And then set the environment setting.
# Compile gluten using debug mode
If you want to just debug java/scala code, there is no need to compile cpp
code with debug mode.
-You can just refer to
[build-gluten-with-velox-backend](../get-started/Velox.md#2-build-gluten-with-velox-backend).
+You can just refer to
[build-gluten-with-velox-backend](https://localhost:4000/docs/getting-started/velox-backend/#build-gluten-with-velox-backend).
If you need to debug cpp code, please compile the backend code and gluten cpp
code with debug mode.
@@ -103,18 +103,18 @@ If you have Ultimate intellij, you can try to debug
remotely.
## Java/Scala code style
-Intellij IDE supports importing settings for Java/Scala code style. You can
import [intellij-codestyle.xml](../../dev/intellij-codestyle.xml) to your IDE.
+Intellij IDE supports importing settings for Java/Scala code style. You can
import
[intellij-codestyle.xml](https://github.com/apache/incubator-gluten/blob/main/dev/intellij-codestyle.xml)
to your IDE.
See [Intellij
guide](https://www.jetbrains.com/help/idea/configuring-code-style.html#import-code-style).
To generate a fix for Java/Scala code style, you can run one or more of the
below commands according to the code modules involved in your PR.
For Velox backend:
-```
+```bash
mvn spotless:apply -Pbackends-velox -Prss -Pspark-3.2 -Pspark-ut -DskipTests
mvn spotless:apply -Pbackends-velox -Prss -Pspark-3.3 -Pspark-ut -DskipTests
```
For Clickhouse backend:
-```
+```bash
mvn spotless:apply -Pbackends-clickhouse -Pspark-3.2 -Pspark-ut -DskipTests
mvn spotless:apply -Pbackends-clickhouse -Pspark-3.3 -Pspark-ut -DskipTests
```
@@ -362,12 +362,12 @@ wait to attach....
# Run TPC-H and TPC-DS
We supply `<gluten_home>/tools/gluten-it` to execute these queries
-Refer to
[velox_be.yml](https://github.com/oap-project/gluten/blob/main/.github/workflows/velox_be.yml)
+Refer to
[velox_docker.yml](https://github.com/apache/incubator-gluten/blob/main/.github/workflows/velox_docker.yml)
# Run gluten+velox on clean machine
We can run gluten + velox on clean machine by one command (supported OS:
Ubuntu20.04/22.04, Centos 7/8, etc.).
-```
+```bash
spark-shell --name run_gluten \
--master yarn --deploy-mode client \
--conf spark.plugins=io.glutenproject.GlutenPlugin \
diff --git a/docs/developers/SubstraitModifications.md
b/docs/developers/SubstraitModifications.md
index 75e0c55..8ba0844 100644
--- a/docs/developers/SubstraitModifications.md
+++ b/docs/developers/SubstraitModifications.md
@@ -19,19 +19,19 @@ alternatives like `AdvancedExtension` could be considered.
## Modifications to algebra.proto
-* Added `JsonReadOptions` and `TextReadOptions` in
`FileOrFiles`([#1584](https://github.com/oap-project/gluten/pull/1584)).
-* Changed join type `JOIN_TYPE_SEMI` to `JOIN_TYPE_LEFT_SEMI` and
`JOIN_TYPE_RIGHT_SEMI`([#408](https://github.com/oap-project/gluten/pull/408)).
+* Added `JsonReadOptions` and `TextReadOptions` in
`FileOrFiles`([#1584](https://github.com/apache/incubator-gluten/pull/1584)).
+* Changed join type `JOIN_TYPE_SEMI` to `JOIN_TYPE_LEFT_SEMI` and
`JOIN_TYPE_RIGHT_SEMI`([#408](https://github.com/apache/incubator-gluten/pull/408)).
* Added `WindowRel`, added `column_name` and `window_type` in `WindowFunction`,
-changed `Unbounded` in `WindowFunction` into `Unbounded_Preceding` and
`Unbounded_Following`, and added
WindowType([#485](https://github.com/oap-project/gluten/pull/485)).
-* Added `output_schema` in
RelRoot([#1901](https://github.com/oap-project/gluten/pull/1901)).
-* Added `ExpandRel`([#1361](https://github.com/oap-project/gluten/pull/1361)).
-* Added `GenerateRel`([#574](https://github.com/oap-project/gluten/pull/574)).
-* Added `PartitionColumn` in
`LocalFiles`([#2405](https://github.com/oap-project/gluten/pull/2405)).
-* Added `WriteRel` ([#3690](https://github.com/oap-project/gluten/pull/3690)).
+changed `Unbounded` in `WindowFunction` into `Unbounded_Preceding` and
`Unbounded_Following`, and added
WindowType([#485](https://github.com/apache/incubator-gluten/pull/485)).
+* Added `output_schema` in
RelRoot([#1901](https://github.com/apache/incubator-gluten/pull/1901)).
+* Added
`ExpandRel`([#1361](https://github.com/apache/incubator-gluten/pull/1361)).
+* Added
`GenerateRel`([#574](https://github.com/apache/incubator-gluten/pull/574)).
+* Added `PartitionColumn` in
`LocalFiles`([#2405](https://github.com/apache/incubator-gluten/pull/2405)).
+* Added `WriteRel`
([#3690](https://github.com/apache/incubator-gluten/pull/3690)).
## Modifications to type.proto
-* Added `Nothing` in
`Type`([#791](https://github.com/oap-project/gluten/pull/791)).
-* Added `names` in
`Struct`([#1878](https://github.com/oap-project/gluten/pull/1878)).
-* Added `PartitionColumns` in
`NamedStruct`([#320](https://github.com/oap-project/gluten/pull/320)).
-* Remove `PartitionColumns` and add `column_types` in
`NamedStruct`([#2405](https://github.com/oap-project/gluten/pull/2405)).
+* Added `Nothing` in
`Type`([#791](https://github.com/apache/incubator-gluten/pull/791)).
+* Added `names` in
`Struct`([#1878](https://github.com/apache/incubator-gluten/pull/1878)).
+* Added `PartitionColumns` in
`NamedStruct`([#320](https://github.com/apache/incubator-gluten/pull/320)).
+* Remove `PartitionColumns` and add `column_types` in
`NamedStruct`([#2405](https://github.com/apache/incubator-gluten/pull/2405)).
diff --git a/docs/developers/docker_centos7.md
b/docs/developers/docker_centos7.md
index 8c11e00..e169f47 100644
--- a/docs/developers/docker_centos7.md
+++ b/docs/developers/docker_centos7.md
@@ -9,14 +9,14 @@ permalink: /docs/developers/docker-centos7/
Here is a docker script we verified to build Gluten+Velox backend on CentOS 7:
Run on host as root user:
-```
+```bash
docker pull centos:7
docker run -itd --name gluten centos:7 /bin/bash
docker attach gluten
```
Run in docker:
-```
+```bash
yum -y install epel-release centos-release-scl
yum -y install \
git \
diff --git a/docs/developers/docker_centos8.md
b/docs/developers/docker_centos8.md
index 20a2ff2..3c733a9 100755
--- a/docs/developers/docker_centos8.md
+++ b/docs/developers/docker_centos8.md
@@ -9,14 +9,14 @@ permalink: /docs/developers/docker-centos8/
Here is a docker script we verified to build Gluten+Velox backend on Centos8:
Run on host as root user:
-```
+```bash
docker pull centos:8
docker run -itd --name gluten centos:8 /bin/bash
docker attach gluten
```
Run in docker:
-```
+```bash
#update mirror
sed -i -e "s|mirrorlist=|#mirrorlist=|g" /etc/yum.repos.d/CentOS-*
sed -i -e
"s|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g"
/etc/yum.repos.d/CentOS-*
diff --git a/docs/developers/docker_ubuntu22.04.md
b/docs/developers/docker_ubuntu22.04.md
index 64364af..4ce4a00 100644
--- a/docs/developers/docker_ubuntu22.04.md
+++ b/docs/developers/docker_ubuntu22.04.md
@@ -11,14 +11,14 @@ To the first build, it's suggested to build Gluten in a
clean docker image. Othe
Here is a docker script we verified to build Gluten+Velox backend on
Ubuntu22.04/20.04:
Run on host as root user:
-```
+```bash
docker pull ubuntu:22.04
docker run -itd --network host --name gluten ubuntu:22.04 /bin/bash
docker attach gluten
```
Run in docker:
-```
+```bash
apt-get update
#install gcc and libraries to build arrow
diff --git a/docs/get-started/ClickHouse.md b/docs/get-started/ClickHouse.md
index 9d880b4..1adb206 100644
--- a/docs/get-started/ClickHouse.md
+++ b/docs/get-started/ClickHouse.md
@@ -15,7 +15,7 @@ We port ClickHouse ( based on version **23.1** ) as a
library, called 'libch.so'
The architecture of the ClickHouse backend is shown below:
-
+
1. On Spark driver, Spark uses Gluten SparkPlugin to transform the physical
plan to the Substrait plan, and then pass the Substrait plan to ClickHouse
backend through JNI call on executors.
2. Based on Spark DataSource V2 interface, implementing a ClickHouse Catalog
to support operating the ClickHouse tables, and then using Delta to save some
metadata about ClickHouse like the MergeTree parts information, and also
provide ACID transactions.
@@ -55,7 +55,7 @@ If you don't care about development environment, you can skip
this part.
Otherwise, do:
1. clone Kyligence/ClickHouse repo
- ```
+ ```bash
cd /to/some/place/
git clone --recursive --shallow-submodules -b clickhouse_backend
https://github.com/Kyligence/ClickHouse.git
```
@@ -78,26 +78,26 @@ Otherwise, do:
- Open ClickHouse repo
- Choose File -> Settings -> Build, Execution, Deployment -> Toolchains,
and then choose Bundled CMake, clang-16 as C Compiler, clang++-16 as C++
Compiler:
-

+

- Choose File -> Settings -> Build, Execution, Deployment -> CMake:
-

+

And then add these options into CMake options:
- ```
+ ```shell
-DENABLE_PROTOBUF=ON -DENABLE_TESTS=OFF -DENABLE_JEMALLOC=ON
-DENABLE_MULTITARGET_CODE=ON -DENABLE_EXTERN_LOCAL_ENGINE=ON
```
- Build 'ch' target on ClickHouse Project with Debug mode or Release mode:
-

+

If it builds with Release mode successfully, there is a library file
called 'libch.so' in path
'${CH_SOURCE_DIR}/cmake-build-release/utils/extern-local-engine/'.
If it builds with Debug mode successfully, there is a library file
called 'libchd.so' in path
'${CH_SOURCE_DIR}/cmake-build-debug/utils/extern-local-engine/'.
4. (Option 2) Use command line
- ```
+ ```shell
cmake --build ${GLUTEN_SOURCE}/cpp-ch/build_ch --target build_ch
```
If it builds successfully, there is a library file called 'libch.so' in
path '${GLUTEN_SOURCE}/cpp-ch/build/utils/extern-local-engine/'.
@@ -106,7 +106,7 @@ Otherwise, do:
In case you don't want a develop environment, you can use the following
command to compile ClickHouse backend directly:
-```
+```bash
git clone https://github.com/apache/incubator-gluten.git
cd incubator-gluten
bash ./ep/build-clickhouse/src/build_clickhouse.sh
@@ -123,7 +123,7 @@ The prerequisites are the same as the one mentioned above.
Compile Gluten with C
- for Spark 3.2.2<span id="deploy-spark-322"></span>
-```
+```bash
git clone https://github.com/apache/incubator-gluten.git
cd incubator-gluten/
export MAVEN_OPTS="-Xmx8g -XX:ReservedCodeCacheSize=2g"
@@ -133,7 +133,7 @@ The prerequisites are the same as the one mentioned above.
Compile Gluten with C
- for Spark 3.3.1
-```
+```bash
git clone https://github.com/apache/incubator-gluten.git
cd incubator-gluten/
export MAVEN_OPTS="-Xmx8g -XX:ReservedCodeCacheSize=2g"
@@ -147,7 +147,7 @@ The prerequisites are the same as the one mentioned above.
Compile Gluten with C
- for Spark 3.2.2
-```
+```bash
tar zxf spark-3.2.2-bin-hadoop2.7.tgz
cd spark-3.2.2-bin-hadoop2.7
rm -f jars/protobuf-java-2.5.0.jar
@@ -160,7 +160,7 @@ cp gluten-XXXXX-spark-3.2-jar-with-dependencies.jar jars/
- for Spark 3.3.1
-```
+```bash
tar zxf spark-3.3.1-bin-hadoop2.7.tgz
cd spark-3.3.1-bin-hadoop2.7
rm -f jars/protobuf-java-2.5.0.jar
@@ -174,7 +174,7 @@ cp gluten-XXXXX-spark-3.3-jar-with-dependencies.jar jars/
#### Query local data
##### Start Spark Thriftserver on local
-```
+```bash
cd spark-3.2.2-bin-hadoop2.7
./sbin/start-thriftserver.sh \
--master local[3] \
@@ -218,7 +218,7 @@ bin/beeline -u jdbc:hive2://localhost:10000/ -n root
Currently, the feature of writing ClickHouse MergeTree parts by Spark is
developing, so you need to use command 'clickhouse-local' to generate MergeTree
parts data manually. We provide a python script to call the command
'clickhouse-local' to convert parquet data to MergeTree parts:
-```
+```bash
#install ClickHouse community version
sudo apt-get install -y apt-transport-https ca-certificates dirmngr
@@ -236,7 +236,7 @@ python3
/path_to_clickhouse_backend_src/utils/local-engine/tool/parquet_to_merge
- Create a TPC-H lineitem table using ClickHouse DataSource
-```
+```sql
DROP TABLE IF EXISTS lineitem;
CREATE TABLE IF NOT EXISTS lineitem (
l_orderkey bigint,
@@ -263,7 +263,7 @@ python3
/path_to_clickhouse_backend_src/utils/local-engine/tool/parquet_to_merge
- TPC-H Q6 test
-```
+```sql
SELECT
sum(l_extendedprice * l_discount) AS revenue
FROM
@@ -279,7 +279,7 @@ python3
/path_to_clickhouse_backend_src/utils/local-engine/tool/parquet_to_merge
The DAG is shown on Spark UI as below:
-

+

##### Query local Parquet files
@@ -493,7 +493,7 @@ This benchmark is tested on AWS EC2 cluster, there are 7
EC2 instances:
- Deploy gluten-core-XXXXX-jar-with-dependencies.jar
-```
+```bash
#deploy 'gluten-core-XXXXX-jar-with-dependencies.jar' to every node, and
then
cp gluten-core-XXXXX-jar-with-dependencies.jar /path_to_spark/jars/
```
@@ -507,7 +507,7 @@ This benchmark is tested on AWS EC2 cluster, there are 7
EC2 instances:
- JuiceFS uses Redis to save metadata, install redis firstly:
-```
+```bash
wget https://download.redis.io/releases/redis-6.0.14.tar.gz
sudo apt install build-essential
tar -zxvf redis-6.0.14.tar.gz
@@ -525,7 +525,7 @@ This benchmark is tested on AWS EC2 cluster, there are 7
EC2 instances:
Please refer to
[The-JuiceFS-Command-Reference](https://juicefs.com/docs/community/command_reference)
-```
+```bash
wget
https://github.com/juicedata/juicefs/releases/download/v0.17.5/juicefs-0.17.5-linux-amd64.tar.gz
tar -zxvf juicefs-0.17.5-linux-amd64.tar.gz
@@ -544,7 +544,7 @@ Please refer to [Data-preparation](#data-preparation) to
generate MergeTree part
#### Run Spark Thriftserver
-```
+```bash
cd spark-3.2.2-bin-hadoop2.7
./sbin/start-thriftserver.sh \
--master spark://master-ip:7070 --deploy-mode client \
@@ -585,7 +585,7 @@ cd spark-3.2.2-bin-hadoop2.7
- Create a lineitem table using clickhouse datasource
-```
+```sql
DROP TABLE IF EXISTS lineitem;
CREATE TABLE IF NOT EXISTS lineitem (
l_orderkey bigint,
@@ -639,7 +639,7 @@ First refer to this URL(https://github.com/apache/celeborn)
to setup a celeborn
When compiling the Gluten Java module, it's required to enable `celeborn`
profile, as follows:
-```
+```bash
mvn clean package -Pbackends-clickhouse -Pspark-3.3 -Pceleborn -DskipTests
```
@@ -649,7 +649,7 @@ Then add the Spark Celeborn Client packages to your Spark
application's classpat
Currently to use Gluten following configurations are required in
`spark-defaults.conf`
-```
+```shell
spark.shuffle.manager
org.apache.spark.shuffle.gluten.celeborn.CelebornShuffleManager
# celeborn master
diff --git a/docs/get-started/Velox.md b/docs/get-started/Velox.md
index 7b88608..1e4c14b 100644
--- a/docs/get-started/Velox.md
+++ b/docs/get-started/Velox.md
@@ -53,7 +53,7 @@ git clone https://github.com/apache/incubator-gluten.git
# Build Gluten with Velox Backend
It's recommended to use buildbundle-veloxbe.sh to build gluten in one script.
-[Gluten build guide](./build-guide.md) listed the parameters and their default
value of build command for your reference.
+[Gluten build guide](build-guide.md) listed the parameters and their default
value of build command for your reference.
**For x86_64 build**
@@ -135,7 +135,7 @@ HDFS uris (hdfs://host:port) will be extracted from a valid
hdfs file path to in
libhdfs3 need a configuration file and [example
here](https://github.com/apache/hawq/blob/e9d43144f7e947e071bba48871af9da354d177d0/src/backend/utils/misc/etc/hdfs-client.xml),
this file is a bit different from hdfs-site.xml and core-site.xml.
Download that example config file to local and do some needed modifications to
support HA or else, then set env variable like below to use it, or upload it to
HDFS to use, more details
[here](https://github.com/apache/hawq/blob/e9d43144f7e947e071bba48871af9da354d177d0/depends/libhdfs3/src/client/Hdfs.cpp#L171-L189).
-```
+```shell
// Spark local mode
export LIBHDFS3_CONF="/path/to/hdfs-client.xml"
@@ -151,14 +151,14 @@ One typical deployment on Spark/HDFS cluster is to enable
[short-circuit reading
By default libhdfs3 does not set the default hdfs domain socket path to
support HDFS short-circuit read. If this feature is required in HDFS setup,
users may need to setup the domain socket path correctly by patching the
libhdfs3 source code or by setting the correct config environment. In Gluten
the short-circuit domain socket path is set to "/var/lib/hadoop-hdfs/dn_socket"
in
[build_velox.sh](https://github.com/apache/incubator-gluten/blob/main/ep/build-velox/src/build_velox.sh)
So we [...]
-```
+```bash
sudo mkdir -p /var/lib/hadoop-hdfs/
sudo chown <sparkuser>:<sparkuser> /var/lib/hadoop-hdfs/
```
You also need to add configuration to the "hdfs-site.xml" as below:
-```
+```xml
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
@@ -175,7 +175,7 @@ Here are two steps to enable kerberos.
- Make sure the hdfs-client.xml contains
-```
+```xml
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
@@ -184,7 +184,7 @@ Here are two steps to enable kerberos.
- Specify the environment variable
[KRB5CCNAME](https://github.com/apache/hawq/blob/e9d43144f7e947e071bba48871af9da354d177d0/depends/libhdfs3/src/client/FileSystem.cpp#L56)
and upload the kerberos ticket cache file
-```
+```shell
--conf spark.executorEnv.KRB5CCNAME=krb5cc_0000 --files /tmp/krb5cc_0000
```
@@ -195,7 +195,7 @@ The ticket cache file can be found by `klist`.
Velox supports ABFS with the open source [Azure SDK for
C++](https://github.com/Azure/azure-sdk-for-cpp) and Gluten uses the Velox ABFS
connector to connect with ABFS.
The build option for ABFS (enable_abfs) must be set to enable this feature as
listed below.
-```
+```bash
cd /path/to/gluten
./dev/buildbundle-veloxbe.sh --enable_abfs=ON
```
@@ -208,7 +208,7 @@ Please refer [Velox ABFS](VeloxABFS.md) part for more
detailed configurations.
Velox supports S3 with the open source [AWS C++
SDK](https://github.com/aws/aws-sdk-cpp) and Gluten uses Velox S3 connector to
connect with S3.
A new build option for S3(enable_s3) is added. Below command is used to enable
this feature
-```
+```bash
cd /path/to/gluten
./dev/buildbundle-veloxbe.sh --enable_s3=ON
```
@@ -225,7 +225,7 @@ First refer to this URL(https://github.com/apache/celeborn)
to setup a celeborn
When compiling the Gluten Java module, it's required to enable `celeborn`
profile, as follows:
-```
+```bash
mvn clean package -Pbackends-velox -Pspark-3.3 -Pceleborn -DskipTests
```
@@ -236,7 +236,7 @@ Then add the Gluten and Spark Celeborn Client packages to
your Spark application
Currently to use Gluten following configurations are required in
`spark-defaults.conf`
-```
+```shell
spark.shuffle.manager
org.apache.spark.shuffle.gluten.celeborn.CelebornShuffleManager
# celeborn master
@@ -274,7 +274,7 @@ First refer to this
URL(https://uniffle.apache.org/docs/intro) to get start with
When compiling the Gluten Java module, it's required to enable `uniffle`
profile, as follows:
-```
+```bash
mvn clean package -Pbackends-velox -Pspark-3.3 -Puniffle -DskipTests
```
@@ -285,7 +285,7 @@ Then add the Uniffle and Spark Celeborn Client packages to
your Spark applicatio
Currently to use Gluten following configurations are required in
`spark-defaults.conf`
-```
+```shell
spark.shuffle.manager
org.apache.spark.shuffle.gluten.uniffle.UniffleShuffleManager
# uniffle coordinator address
spark.rss.coordinator.quorum ip:port
@@ -308,7 +308,7 @@ Gluten with velox backend supports
[DeltaLake](https://delta.io/) table.
First of all, compile gluten-delta module by a `delta` profile, as follows:
-```
+```bash
mvn clean package -Pbackends-velox -Pspark-3.3 -Pdelta -DskipTests
```
@@ -328,7 +328,7 @@ Gluten with velox backend supports
[Iceberg](https://iceberg.apache.org/) table.
First of all, compile gluten-iceberg module by a `iceberg` profile, as follows:
-```
+```bash
mvn clean package -Pbackends-velox -Pspark-3.3 -Piceberg -DskipTests
```
@@ -344,7 +344,7 @@ Spark3.3 has 387 functions in total. ~240 are commonly
used. To get the support
To identify what can be offloaded in a query and detailed fallback reasons,
user can follow below steps to retrieve corresponding logs.
-```
+```shell
1) Enable Gluten by proper
[configuration](https://github.com/apache/incubator-gluten/blob/main/docs/Configuration.md).
2) Disable Spark AQE to trigger plan validation in Gluten
diff --git a/docs/get-started/VeloxABFS.md b/docs/get-started/VeloxABFS.md
index 1aa4ec8..f961129 100644
--- a/docs/get-started/VeloxABFS.md
+++ b/docs/get-started/VeloxABFS.md
@@ -14,7 +14,7 @@ ABFS is an important data store for big data users. This doc
discusses config de
To configure access to your storage account, replace <storage-account> with
the name of your account. This property aligns with Spark configurations. By
setting this config multiple times using different storage account names, you
can access multiple ABFS accounts.
-```sh
+```shell
spark.hadoop.fs.azure.account.key.<storage-account>.dfs.core.windows.net
XXXXXXXXX
```
diff --git a/docs/get-started/VeloxGCS.md b/docs/get-started/VeloxGCS.md
index 3fd362d..fa519b0 100644
--- a/docs/get-started/VeloxGCS.md
+++ b/docs/get-started/VeloxGCS.md
@@ -28,7 +28,7 @@ This is described in the [instructions to configure a service
account]https://cl
Such json file with the credetials can be passed to Gluten:
-```sh
+```shell
spark.hadoop.fs.gs.auth.type
SERVICE_ACCOUNT_JSON_KEYFILE
spark.hadoop.fs.gs.auth.service.account.json.keyfile // path to the json file
with the credentials.
```
@@ -36,20 +36,20 @@ spark.hadoop.fs.gs.auth.service.account.json.keyfile //
path to the json file wi
## Configuring GCS endpoints
For cases when a GCS mock is used, an optional endpoint can be provided:
-```sh
+```shell
spark.hadoop.fs.gs.storage.root.url // url to the mock gcs service including
starting with http or https
```
## Configuring GCS max retry count
For cases when a transient server error is detected, GCS can be configured to
keep retrying until a number of transient error is detected.
-```sh
+```shell
spark.hadoop.fs.gs.http.max.retry // number of times to keep retrying unless a
non-transient error is detected
```
## Configuring GCS max retry time
For cases when a transient server error is detected, GCS can be configured to
keep retrying until the retry loop exceeds a prescribed duration.
-```sh
+```shell
spark.hadoop.fs.gs.http.max.retry-time // a string representing the time keep
retring (10s, 1m, etc).
```
diff --git a/docs/get-started/VeloxLocalCache.md
b/docs/get-started/VeloxLocalCache.md
index ead04ed..b67a1ab 100644
--- a/docs/get-started/VeloxLocalCache.md
+++ b/docs/get-started/VeloxLocalCache.md
@@ -9,7 +9,7 @@ permalink: /docs/getting-started/localcache/
Velox supports a local cache when reading data from HDFS/S3/ABFS. With this
feature, Velox can asynchronously cache the data on local disk when reading
from remote storage and future read requests on previously cached blocks will
be serviced from local cache files. To enable the local caching feature, the
following configurations are required:
-```
+```shell
spark.gluten.sql.columnar.backend.velox.cacheEnabled // enable or disable
velox cache, default false.
spark.gluten.sql.columnar.backend.velox.memCacheSize // the total size of
in-mem cache, default is 128MB.
spark.gluten.sql.columnar.backend.velox.ssdCachePath // the folder to
store the cache files, default is "/tmp".
diff --git a/docs/velox-backend/velox-backend-limitations.md
b/docs/velox-backend/velox-backend-limitations.md
index eb44c91..ff329f8 100644
--- a/docs/velox-backend/velox-backend-limitations.md
+++ b/docs/velox-backend/velox-backend-limitations.md
@@ -1,11 +1,13 @@
---
layout: page
title: Limitations
-nav_order: 1
+nav_order: 2
parent: Velox Backend
grand_parent: Documentations
permalink: /docs/velox-backend/limitations/
---
+
+# Velox Backend Limitations
This document describes the limitations of velox backend by listing some known
cases where exception will be thrown, gluten behaves incompatibly with spark,
or certain plan's execution
must fall back to vanilla spark, etc.
diff --git a/docs/velox-backend/velox-backend-support-progress.md
b/docs/velox-backend/velox-backend-support-progress.md
index 36dd75f..e76c755 100644
--- a/docs/velox-backend/velox-backend-support-progress.md
+++ b/docs/velox-backend/velox-backend-support-progress.md
@@ -1,7 +1,7 @@
---
layout: page
title: Supported Operators & Functions
-nav_order: 2
+nav_order: 1
parent: Velox Backend
grand_parent: Documentations
permalink: /docs/velox-backend/support/
diff --git a/docs/velox-backend/velox-backend-troubleshooting.md
b/docs/velox-backend/velox-backend-troubleshooting.md
index 440e0ce..7517761 100644
--- a/docs/velox-backend/velox-backend-troubleshooting.md
+++ b/docs/velox-backend/velox-backend-troubleshooting.md
@@ -6,7 +6,7 @@ parent: Velox Backend
grand_parent: Documentations
permalink: /docs/velox-backend/troubleshooting/
---
-## Troubleshooting
+# Velox Backend Troubleshooting
### Fatal error after native exception is thrown
diff --git a/docs/velox-backend/velox-backend-udf.md
b/docs/velox-backend/velox-backend-udf.md
new file mode 100644
index 0000000..750202f
--- /dev/null
+++ b/docs/velox-backend/velox-backend-udf.md
@@ -0,0 +1,239 @@
+---
+layout: page
+title: Velox UDF
+nav_order: 4
+parent: Velox Backend
+grand_parent: Documentations
+permalink: /docs/velox-backend/udf/
+---
+# Velox User-Defined Functions (UDF) and User-Defined Aggregate Functions
(UDAF)
+
+## Introduction
+
+Velox backend supports User-Defined Functions (UDF) and User-Defined Aggregate
Functions (UDAF).
+Users can create their own functions using the UDF interface provided in Velox
backend and build libraries for these functions.
+At runtime, the UDF are registered at the start of applications.
+Once registered, Gluten will be able to parse and offload these UDF into Velox
during execution.
+
+## Create and Build UDF/UDAF library
+
+The following steps demonstrate how to set up a UDF library project:
+
+- **Include the UDF Interface Header:**
+ First, include the UDF interface header file
[Udf.h](../../cpp/velox/udf/Udf.h) in the project file.
+ The header file defines the `UdfEntry` struct, along with the macros for
declaring the necessary functions to integrate the UDF into Gluten and Velox.
+
+- **Implement the UDF:**
+ Implement UDF. These functions should be able to register to Velox.
+
+- **Implement the Interface Functions:**
+ Implement the following interface functions that integrate UDF into Project
Gluten:
+
+ - `getNumUdf()`:
+ This function should return the number of UDF in the library.
+ This is used to allocating udfEntries array as the argument for the next
function `getUdfEntries`.
+
+ - `getUdfEntries(gluten::UdfEntry* udfEntries)`:
+ This function should populate the provided udfEntries array with the
details of the UDF, including function names and signatures.
+
+ - `registerUdf()`:
+ This function is called to register the UDF to Velox function registry.
+ This is where users should register functions by calling
`facebook::velox::exec::registerVecotorFunction` or other Velox APIs.
+
+ - The interface functions are mapped to marcos in
[Udf.h](../../cpp/velox/udf/Udf.h). Here's an example of how to implement these
functions:
+
+ ```
+ // Filename MyUDF.cc
+
+ #include <velox/expression/VectorFunction.h>
+ #include <velox/udf/Udf.h>
+
+ namespace {
+ static const char* kInteger = "integer";
+ }
+
+ const int kNumMyUdf = 1;
+
+ const char* myUdfArgs[] = {kInteger}:
+ gluten::UdfEntry myUdfSig = {"myudf", kInteger, 1, myUdfArgs};
+
+ class MyUdf : public facebook::velox::exec::VectorFunction {
+ ... // Omit concrete implementation
+ }
+
+ static std::vector<std::shared_ptr<exec::FunctionSignature>>
+ myUdfSignatures() {
+ return {facebook::velox::exec::FunctionSignatureBuilder()
+ .returnType(myUdfSig.dataType)
+ .argumentType(myUdfSig.argTypes[0])
+ .build()};
+ }
+
+ DEFINE_GET_NUM_UDF { return kNumMyUdf; }
+
+ DEFINE_GET_UDF_ENTRIES { udfEntries[0] = myUdfSig; }
+
+ DEFINE_REGISTER_UDF {
+ facebook::velox::exec::registerVectorFunction(
+ myUdf[0].name, myUdfSignatures(), std::make_unique<MyUdf>());
+ }
+
+ ```
+
+To build the UDF library, users need to compile the C++ code and link to
`libvelox.so`.
+It's recommended to create a CMakeLists.txt for the project. Here's an example:
+
+```
+project(myudf)
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+
+set(GLUTEN_HOME /path/to/gluten)
+
+add_library(myudf SHARED "MyUDF.cpp")
+
+find_library(VELOX_LIBRARY REQUIRED NAMES velox HINTS
${GLUTEN_HOME}/cpp/build/releases NO_DEFAULT_PATH)
+
+target_include_directories(myudf PRIVATE ${GLUTEN_HOME}/cpp
${GLUTEN_HOME}/ep/build-velox/build/velox_ep)
+target_link_libraries(myudf PRIVATE ${VELOX_LIBRARY})
+```
+
+The steps for creating and building a UDAF library are quite similar to those
for a UDF library.
+The major difference lies in including and defining specific functions within
the UDAF header file [Udaf.h](../../cpp/velox/udf/Udaf.h)
+
+- `getNumUdaf()`
+- `getUdafEntries(gluten::UdafEntry* udafEntries)`
+- `registerUdaf()`
+
+`gluten::UdafEntry` requires an additional field `intermediateType`, to
specify the output type from partial aggregation.
+For detailed implementation, you can refer to the example code in
[MyUDAF.cc](../../cpp/velox/udf/examples/MyUDAF.cc)
+
+## Using UDF/UDAF in Gluten
+
+Gluten loads the UDF libraries at runtime. You can upload UDF libraries via
`--files` or `--archives`, and configure the library paths using the provided
Spark configuration, which accepts comma separated list of library paths.
+
+Note if running on Yarn client mode, the uploaded files are not reachable on
driver side. Users should copy those files to somewhere reachable for driver
and set `spark.gluten.sql.columnar.backend.velox.driver.udfLibraryPaths`.
+This configuration is also useful when the `udfLibraryPaths` is different
between driver side and executor side.
+
+- Use the `--files` option to upload a library and configure its relative path
+
+```shell
+--files /path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so
+--conf spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=libmyudf.so
+# Needed for Yarn client mode
+--conf
spark.gluten.sql.columnar.backend.velox.driver.udfLibraryPaths=file:///path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so
+```
+
+- Use the `--archives` option to upload an archive and configure its relative
path
+
+```shell
+--archives /path/to/udf_archives.zip#udf_archives
+--conf spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=udf_archives
+# Needed for Yarn client mode
+--conf
spark.gluten.sql.columnar.backend.velox.driver.udfLibraryPaths=file:///path/to/udf_archives.zip
+```
+
+- Configure URI
+
+You can also specify the local or HDFS URIs to the UDF libraries or archives.
Local URIs should exist on driver and every worker nodes.
+
+```shell
+--conf
spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=file:///path/to/library_or_archive
+```
+
+## Try the example
+
+We provided Velox UDF examples in file
[MyUDF.cc](../../cpp/velox/udf/examples/MyUDF.cc) and UDAF examples in file
[MyUDAF.cc](../../cpp/velox/udf/examples/MyUDAF.cc).
+You need to build the gluten project with `--build_example=ON` to get the
example libraries.
+
+```shell
+./dev/buildbundle-veloxbe.sh --build_examples=ON
+```
+
+Then, you can find the example libraries at
/path/to/gluten/cpp/build/velox/udf/examples/
+
+Start spark-shell or spark-sql with below configuration
+
+```shell
+# Use the `--files` option to upload a library and configure its relative path
+--files /path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so
+--conf spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=libmyudf.so
+```
+
+or
+
+```shell
+# Only configure URI
+--conf
spark.gluten.sql.columnar.backend.velox.udfLibraryPaths=file:///path/to/gluten/cpp/build/velox/udf/examples/libmyudf.so
+```
+
+Run query. The functions `myudf1` and `myudf2` increment the input value by a
constant of 5
+
+```
+select myudf1(100L), myudf2(1)
+```
+
+The output from spark-shell will be like
+
+```
++------------------+----------------+
+|udfexpression(100)|udfexpression(1)|
++------------------+----------------+
+| 105| 6|
++------------------+----------------+
+```
+
+## Configurations
+
+| Parameters | Description
|
+|----------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
+| spark.gluten.sql.columnar.backend.velox.udfLibraryPaths | Path to the
udf/udaf libraries.
|
+| spark.gluten.sql.columnar.backend.velox.driver.udfLibraryPaths | Path to the
udf/udaf libraries on driver node. Only applicable on yarn-client mode.
|
+| spark.gluten.sql.columnar.backend.velox.udfAllowTypeConversion | Whether to
inject possible `cast` to convert mismatched data types from input to one
registered signatures. |
+
+# Pandas UDFs (a.k.a. Vectorized UDFs)
+
+## Introduction
+
+Pandas UDFs are user defined functions that are executed by Spark using Arrow
to transfer data and Pandas to work with the data, which allows vectorized
operations. A Pandas UDF is defined using the pandas_udf() as a decorator or to
wrap the function, and no additional configuration is required.
+A Pandas UDF behaves as a regular PySpark function API in general. For more
details, you can refer
[doc](https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html).
+
+## Using Pandas UDFs in Gluten with Velox Backend
+
+Similar as in vanilla Spark, user needs to set up pyspark/arrow dependencies
properly first. You may can refer following steps:
+
+```
+pip3 install pyspark==$SPARK_VERSION cython
+pip3 install pandas pyarrow
+```
+
+Gluten provides a config to control enable `ColumnarArrowEvalPython` or not,
with `true` as defalt.
+
+```
+spark.gluten.sql.columnar.arrowUdf
+```
+
+Then take following `PySpark` code for example:
+
+```
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+import pyspark.sql.functions as F
+import os
+@pandas_udf('long')
+def pandas_plus_one(v):
+ return (v + 1)
+df =
spark.read.orc("path_to_file").select("quantity").withColumn("processed_quantity",
pandas_plus_one("quantity")).select("quantity")
+```
+
+The expected physical plan will be:
+
+```
+== Physical Plan ==
+VeloxColumnarToRowExec
++- ^(2) ProjectExecTransformer [pythonUDF0#45L AS processed_quantity#41L]
+ +- ^(2) InputIteratorTransformer[quantity#2L, pythonUDF0#45L]
+ +- ^(2) InputAdapter
+ +- ^(2) ColumnarArrowEvalPython [pandas_plus_one(quantity#2L)#40L],
[pythonUDF0#45L], 200
+ +- ^(1) NativeFileScan orc [quantity#2L] Batched: true,
DataFilters: [], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/***],
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<quantity:bigint>
+```
diff --git a/images/apache-incubator.svg b/images/apache-incubator.svg
deleted file mode 100644
index dc152b4..0000000
--- a/images/apache-incubator.svg
+++ /dev/null
@@ -1 +0,0 @@
-{"payload":{"allShortcutsEnabled":false,"fileTree":{"static/img":{"items":[{"name":"benchmarks","path":"static/img/benchmarks","contentType":"directory"},{"name":"apache-incubator.svg","path":"static/img/apache-incubator.svg","contentType":"file"},{"name":"case1.png","path":"static/img/case1.png","contentType":"file"},{"name":"case2.png","path":"static/img/case2.png","contentType":"file"},{"name":"favicon.ico","path":"static/img/favicon.ico","contentType":"file"},{"name":"fury_banner.png
[...]
\ No newline at end of file
diff --git a/images/gluten-logo-blue.png b/images/gluten-logo-blue.png
deleted file mode 100644
index 801fd9b..0000000
Binary files a/images/gluten-logo-blue.png and /dev/null differ
diff --git a/index.md b/index.md
index 9cbc8a7..fddc91f 100644
--- a/index.md
+++ b/index.md
@@ -36,15 +36,12 @@ The basic rule of Gluten's design is that we would reuse
spark's whole control f
## 1.3 Target User
Gluten's target user is anyone who wants to accelerate SparkSQL fundamentally.
As a plugin to Spark, Gluten doesn't require any change for dataframe API or
SQL query, but only requires user to make correct configuration.
-See Gluten configuration properties
[here](https://github.com/apache/incubator-gluten-site/blob/main/docs/VeloxBuildGuide.md).
+See Gluten configuration properties
[here](https://gluten.apache.org/docs/configuration/).
## 1.4 References
You can click below links for more related information.
-- [Gluten Intro Video at Data AI Summit
2022](https://www.youtube.com/watch?v=0Q6gHT_N-1U)
-- [Gluten Intro Article at
Medium.com](https://medium.com/intel-analytics-software/accelerate-spark-sql-queries-with-gluten-9000b65d1b4e)
-- [Gluten Intro Article at Kyligence.io(in
Chinese)](https://cn.kyligence.io/blog/gluten-spark/)
-- [Velox Intro from
Meta](https://engineering.fb.com/2023/03/09/open-source/velox-open-source-execution-engine/)
+- [Gluten References](https://gluten.apache.org/references/)
# 2 Architecture
diff --git a/references.md b/references.md
new file mode 100644
index 0000000..72f6837
--- /dev/null
+++ b/references.md
@@ -0,0 +1,32 @@
+---
+layout: page
+title: Gluten References
+nav_order: 6
+permalink: /references/
+---
+
+# Gluten related Use Cases and Publications
+
+For more information on Apache Gluten (incubating), including related articles
and videos, please refer to the following links.
+
+| Title | Company | Publish Date | Reference |
+|-------------------|---------|--------------|-----------------|
+|Gluten: A middle layer to offload Spark SQL to Native|Intel|Jul.
2022|[Link](https://www.youtube.com/watch?v=0Q6gHT_N-1U&ab_channel=Databricks)|
+|Best Exploration of Columnar Shuffle Design|Intel|Jul.
2022|[Link](https://www.youtube.com/watch?v=RICMojO0j1A&ab_channel=Databricks)|
+|Gluten Introduction|Intel|Sept.
2022|[Link](https://medium.com/intel-analytics-software/accelerate-spark-sql-queries-with-gluten-9000b65d1b4e)|
+|Introducing Velox|Meta|Mar.
2023|[Link](https://engineering.fb.com/2023/03/09/open-source/velox-open-source-execution-engine/)|
+|Gluten in Kyligence(in Chinese)|Kyligence|Mar.
2023|[Link](https://zhuanlan.zhihu.com/p/617944074)|
+|Velox in Intel|Intel|Jun.
2023|[Link](https://www.youtube.com/watch?v=yZ8F1vWqFXw&ab_channel=PrestoFoundation)|
+|Gluten + Celeborn(in Chinese)|Intel|Jul.
2023|[Link](https://blog.csdn.net/weixin_45906054/article/details/131651065)|
+|Gluten: Modernizing Java-based Query Engines|Intel|Sept.
2023|[Link](https://ceur-ws.org/Vol-3462/CDMS8.pdf)|
+|Apache Spark with Native Engine|BigO|Nov.
2023|[Link](https://blog.csdn.net/m0_70952941/article/details/134396816)|
+|BONC's BEH with Gluten and AVX512|BONC|Nov.
2023|[Link](https://www.intel.cn/content/www/cn/zh/artificial-intelligence/analytics/bonc-big-data-solutions-optimized-avx512-and-qat.html)|
+|Gluten: Double Spark Performance|Intel|Nov.
2023|[Link](https://www.slidestalk.com/slidestalk/71777?video)|
+|Apache Spark Native Engine(in Chinese)|Netease|Nov.
2023|[Link](https://zhuanlan.zhihu.com/p/670297787)|
+|Apache Spark Native Engine|Netease|Dec.
2023|[Link](https://medium.com/@KyuubiApache/apache-spark-native-engine-3e1060567ed0)|
+|Apache Gluten Status Update|Intel|Apr.
2024|[Link](https://www.youtube.com/watch?v=H7L5W6Vio3U&list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&index=8)|
+|Accelearting Spark at Microsot using Gluten and Velox|Microsoft|Apr.
2024|[Link](https://www.youtube.com/watch?v=7pXOAjSITYs&ab_channel=VeloxCon)|
+|Velox at IBM|IBM|Apr.
2024|[Link](https://youtu.be/npoEudB5nPo?si=hToh-acObN3miM1Q)|
+|Unlocking Data Query Performance|Pinterest|Apr.
2024|[Link](https://www.youtube.com/watch?v=pQ4bMyXXLss&list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR&index=10&t=3s&ab_channel=VeloxCon)|
+|Native execution engine for Fabric Spark|Microsoft|May
2024|[Link](https://learn.microsoft.com/en-us/fabric/data-engineering/native-execution-engine-overview?tabs=sparksql)|
+|Best Practice of Gluten and Velox in Meituan|Meituan|Jun.
2024|[Link](https://mp.weixin.qq.com/s/VvmhQi8YMsm0P5xYoiGEZQ )|
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]