This is an automated email from the ASF dual-hosted git repository.
dnskr pushed a commit to branch branch-1.10
in repository https://gitbox.apache.org/repos/asf/kyuubi.git
The following commit(s) were added to refs/heads/branch-1.10 by this push:
new 4bb0d1839d [KYUUBI #7222] [DOC] Fix "document isn't included in any
toctree" warnings
4bb0d1839d is described below
commit 4bb0d1839d3715e64af6b8a9ddf5cc9ecfd45222
Author: dnskr <[email protected]>
AuthorDate: Wed Oct 15 23:25:02 2025 +0200
[KYUUBI #7222] [DOC] Fix "document isn't included in any toctree" warnings
### Why are the changes needed?
The PR fixes a few `document isn't included in any toctree` issues:
* Page `JVM Quake Support` now included into `Extensions for Spark` menu:
```shell
./kyuubi/docs/extensions/engines/spark/jvm-quake.md: WARNING: document
isn't included in any toctree
```
* Page `Z-order Benchmark` consolidated with `Z-Ordering Support` page:
```shell
./kyuubi/docs/extensions/engines/spark/z-order-benchmark.md: WARNING:
document isn't included in any toctree
```
* File `requirements.txt` added to `exclude_patterns`:
```shell
./kyuubi/docs/requirements.txt: WARNING: document isn't included in any
toctree
```
### How was this patch tested?
Checked that there are no warnings mentioned above during the documentation
build process.
```shell
make html
```
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #7222 from dnskr/doc-fix-document-isn't-included-in-any-toctree.
Closes #7222
cf7fb1f48 [dnskr] [DOC] Fix "document isn't included in any toctree"
warnings
Authored-by: dnskr <[email protected]>
Signed-off-by: dnskr <[email protected]>
(cherry picked from commit 54212eff8b6cbb6654d572c910e1fc6209e6a8ad)
Signed-off-by: dnskr <[email protected]>
---
docs/conf.py | 2 +-
docs/extensions/engines/spark/index.rst | 1 +
docs/extensions/engines/spark/z-order-benchmark.md | 230 ---------------------
docs/extensions/engines/spark/z-order.md | 215 ++++++++++++++++++-
4 files changed, 214 insertions(+), 234 deletions(-)
diff --git a/docs/conf.py b/docs/conf.py
index 6ec273a072..9205851497 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -92,7 +92,7 @@ templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+exclude_patterns = ['requirements.txt', '_build', 'Thumbs.db', '.DS_Store']
# -- Options for HTML output -------------------------------------------------
diff --git a/docs/extensions/engines/spark/index.rst
b/docs/extensions/engines/spark/index.rst
index b8eea3045c..9b157f02d2 100644
--- a/docs/extensions/engines/spark/index.rst
+++ b/docs/extensions/engines/spark/index.rst
@@ -26,3 +26,4 @@ Extensions for Spark
../../../connector/spark/index
lineage
jdbc-dialect
+ jvm-quake
diff --git a/docs/extensions/engines/spark/z-order-benchmark.md
b/docs/extensions/engines/spark/z-order-benchmark.md
deleted file mode 100644
index 293042530d..0000000000
--- a/docs/extensions/engines/spark/z-order-benchmark.md
+++ /dev/null
@@ -1,230 +0,0 @@
-<!--
-- x to the Apache Software Foundation (ASF) under one or more
-- contributor license agreements. See the NOTICE file distributed with
-- this work for additional information regarding copyright ownership.
-- The ASF licenses this file to You under the Apache License, Version 2.0
-- (the "License"); you may not use this file except in compliance with
-- the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--->
-
-# Z-order Benchmark
-
-Z-order is a technique that allows you to map multidimensional data to a
single dimension. We did a performance test.
-
-For this test ,we used aliyun Databricks Delta test case
-https://help.aliyun.com/document_detail/168137.html?spm=a2c4g.11186623.6.563.10d758ccclYtVb.
-
-Prepare data for the three scenarios:
-
-1. 10 billion data and 2 hundred files (parquet files): for big file(1G)
-2. 10 billion data and 1 thousand files (parquet files): for medium file(200m)
-3. 1 billion data and 10 thousand files (parquet files): for smaller file(200k)
-
-Test env:
-spark-3.1.2
-hadoop-2.7.2
-kyuubi-1.4.0
-
-Test step:
-
-Step1: create hive tables.
-
-```scala
-spark.sql(s"drop database if exists $dbName cascade")
-spark.sql(s"create database if not exists $dbName")
-spark.sql(s"use $dbName")
-spark.sql(s"create table $connRandomParquet (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
-spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
-spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip
string, dst_port int) stored as parquet")
-spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
-spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip
string, dst_port int) stored as parquet")
-spark.sql(s"show tables").show(false)
-```
-
-Step2: prepare data for parquet table with three scenarios,
-we use the following code.
-
-```scala
-def randomIPv4(r: Random) = Seq.fill(4)(r.nextInt(256)).mkString(".")
-def randomPort(r: Random) = r.nextInt(65536)
-
-def randomConnRecord(r: Random) = ConnRecord(
- src_ip = randomIPv4(r), src_port = randomPort(r),
- dst_ip = randomIPv4(r), dst_port = randomPort(r))
-```
-
-Step3: do optimize with z-order only ip and do optimize with order by only ip,
sort column: src_ip, dst_ip and shuffle partition just as file numbers.
-
-```
-INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip,
dst_port from conn_random_parquet order by src_ip, dst_ip;
-OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
-```
-
-Step4: do optimize with z-order and do optimize with order by, sort column:
src_ip, src_port, dst_ip, dst_port and shuffle partition just as file numbers.
-
-```
-INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port
from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
-OPTIMIZE conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
-```
-
-The complete code is as follows:
-
-```shell
-./spark-shell
-import org.apache.spark.SparkConf
-import org.apache.spark.sql.SparkSession
-
-case class ConnRecord(src_ip: String, src_port: Int, dst_ip: String, dst_port:
Int)
-
-val conf = new SparkConf().setAppName("zorder_test")
-val spark =
SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
-import spark.implicits._
-
-val sc = spark.sparkContext
-sc.setLogLevel("WARN")
-//ten billion rows and two hundred files
-val numRecords = 10*1000*1000*1000L
-val numFiles = 200
-
-val dbName = s"zorder_test_$numFiles"
-val baseLocation = s"hdfs://localhost:9000/zorder_test/$dbName/"
-val connRandomParquet = "conn_random_parquet"
-val connZorderOnlyIp = "conn_zorder_only_ip"
-val connZorder = "conn_zorder"
-spark.conf.set("spark.sql.shuffle.partitions", numFiles)
-spark.conf.get("spark.sql.shuffle.partitions")
-spark.conf.set("spark.sql.hive.convertMetastoreParquet",false)
-spark.sql(s"drop database if exists $dbName cascade")
-spark.sql(s"create database if not exists $dbName")
-spark.sql(s"use $dbName")
-spark.sql(s"create table $connRandomParquet (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
-spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
-spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip
string, dst_port int) stored as parquet")
-spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
-spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip
string, dst_port int) stored as parquet")
-spark.sql(s"show tables").show(false)
-
-import scala.util.Random
-// Function for preparing Zorder_Test data
-def randomIPv4(r: Random) = Seq.fill(4)(r.nextInt(256)).mkString(".")
-def randomPort(r: Random) = r.nextInt(65536)
-
-def randomConnRecord(r: Random) = ConnRecord(
-src_ip = randomIPv4(r), src_port = randomPort(r),
-dst_ip = randomIPv4(r), dst_port = randomPort(r))
-
-val df = spark.range(0, numFiles, 1, numFiles).mapPartitions { it =>
-val partitionID = it.toStream.head
-val r = new Random(seed = partitionID)
-Iterator.fill((numRecords / numFiles).toInt)(randomConnRecord(r))
-}
-
-df.write
-.mode("overwrite")
-.format("parquet")
-.insertInto(connRandomParquet)
-
-spark.read.table(connRandomParquet)
-.write
-.mode("overwrite")
-.format("parquet")
-.insertInto(connZorderOnlyIp)
-
-spark.read.table(connRandomParquet)
-.write
-.mode("overwrite")
-.format("parquet")
-.insertInto(connZorder)
-spark.stop()
-
-```
-
-Z-order Optimize statement:
-
-```sql
-set spark.sql.hive.convertMetastoreParquet=false;
-
-OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
-
-OPTIMIZE zorder_test.conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
-```
-
-ORDER BY statement:
-
-```
-INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip,
dst_port from conn_random_parquet order by src_ip, dst_ip;
-
-INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port
from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
-```
-
-Query statement:
-
-```sql
-set spark.sql.hive.convertMetastoreParquet=true;
-
-select count(*) from conn_random_parquet where src_ip like '157%' and dst_ip
like '216.%';
-
-select count(*) from conn_zorder_only_ip where src_ip like '157%' and dst_ip
like '216.%';
-
-select count(*) from conn_zorder where src_ip like '157%' and dst_ip like
'216.%';
-```
-
-## Benchmark result
-
-We have done two performance tests: one is to compare the efficiency of
Z-order Optimize and Order by Sort,
-and the other is to query based on the optimized Z-order by data and Random
data.
-
-### Efficiency of Z-order Optimize and Order-by Sort
-
-**10 billion data and 1000 files and Query resource: 200 core 600G memory**
-
-Z-order by or order by only ip:
-
-| Table | row count | optimize time |
-|---------------------|----------------|----------------|
-| conn_order_only_ip | 10,000,000,000 | 1591.99 s |
-| conn_zorder_only_ip | 10,000,000,000 | 8371.405 s |
-
-Z-order by or order by all columns:
-
-| Table | row count | optimize time |
-|-------------|----------------|----------------|
-| conn_order | 10,000,000,000 | 1515.298 s |
-| conn_zorder | 10,000,000,000 | 11057.194 s |
-
-### Z-order by benchmark result
-
-By querying the tables before and after optimization, we find that:
-
-**10 billion data and 200 files and Query resource: 200 core 600G memory**
-
-| Table | Average File Size | Scan row count | Average query
time | row count Skipping ratio |
-|---------------------|-------------------|----------------|--------------------|--------------------------|
-| conn_random_parquet | 1.2 G | 10,000,000,000 | 27.554 s
| 0.0% |
-| conn_zorder_only_ip | 890 M | 43,170,600 | 2.459 s
| 99.568% |
-| conn_zorder | 890 M | 54,841,302 | 3.185 s
| 99.451% |
-
-**10 billion data and 1000 files and Query resource: 200 core 600G memory**
-
-| Table | Average File Size | Scan row count | Average query
time | row count Skipping ratio |
-|---------------------|-------------------|----------------|--------------------|--------------------------|
-| conn_random_parquet | 234.8 M | 10,000,000,000 | 27.031 s
| 0.0% |
-| conn_zorder_only_ip | 173.9 M | 53,499,068 | 3.120 s
| 99.465% |
-| conn_zorder | 174.0 M | 35,910,500 | 3.103 s
| 99.640% |
-
-**1 billion data and 10000 files and Query resource: 10 core 40G memory**
-
-| Table | Average File Size | Scan row count | Average query
time | row count Skipping ratio |
-|---------------------|-------------------|----------------|--------------------|--------------------------|
-| conn_random_parquet | 2.7 M | 1,000,000,000 | 76.772 s
| 0.0% |
-| conn_zorder_only_ip | 2.1 M | 406,572 | 3.963 s
| 99.959% |
-| conn_zorder | 2.2 M | 387,942 | 3.621s
| 99.961% |
-
diff --git a/docs/extensions/engines/spark/z-order.md
b/docs/extensions/engines/spark/z-order.md
index 80e85cdf4e..e38cf0e044 100644
--- a/docs/extensions/engines/spark/z-order.md
+++ b/docs/extensions/engines/spark/z-order.md
@@ -20,8 +20,6 @@
To improve query speed, Kyuubi supports Z-Ordering to optimize the layout of
data
stored in all kind of storage with various data format.
-Please check our benchmark report [here](z-order-benchmark.md).
-
## Introduction
The following picture shows the workflow of z-order.
@@ -29,9 +27,10 @@ The following picture shows the workflow of z-order.

It contains three parties:
+
- Upstream
-Due to the extra sort, the upstream job will run a little slower than before
+ Due to the extra sort, the upstream job will run a little slower than before
- Table
@@ -121,3 +120,213 @@ INSERT OVERWRITE TABLE t1 PARTITION() ...;
CREATE TABLE t1 AS SELECT ...;
```
+## Benchmark
+
+For this test, we used aliyun Databricks Delta test case
+https://help.aliyun.com/document_detail/168137.html?spm=a2c4g.11186623.6.563.10d758ccclYtVb.
+
+### Overview
+
+Prepare data for the three scenarios:
+
+1. 10 billion data and 2 hundred files (parquet files): for big file(1G)
+2. 10 billion data and 1 thousand files (parquet files): for medium file(200m)
+3. 1 billion data and 10 thousand files (parquet files): for smaller file(200k)
+
+Test env:
+spark-3.1.2
+hadoop-2.7.2
+kyuubi-1.4.0
+
+Test step:
+
+Step1: create hive tables.
+
+```scala
+spark.sql(s"drop database if exists $dbName cascade")
+spark.sql(s"create database if not exists $dbName")
+spark.sql(s"use $dbName")
+spark.sql(s"create table $connRandomParquet (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip
string, dst_port int) stored as parquet")
+spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip
string, dst_port int) stored as parquet")
+spark.sql(s"show tables").show(false)
+```
+
+Step2: prepare data for parquet table with three scenarios,
+we use the following code.
+
+```scala
+def randomIPv4(r: Random) = Seq.fill(4)(r.nextInt(256)).mkString(".")
+def randomPort(r: Random) = r.nextInt(65536)
+
+def randomConnRecord(r: Random) = ConnRecord(
+ src_ip = randomIPv4(r), src_port = randomPort(r),
+ dst_ip = randomIPv4(r), dst_port = randomPort(r))
+```
+
+Step3: do optimize with z-order only ip and do optimize with order by only ip,
sort column: src_ip, dst_ip and shuffle partition just as file numbers.
+
+```
+INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip,
dst_port from conn_random_parquet order by src_ip, dst_ip;
+OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
+```
+
+Step4: do optimize with z-order and do optimize with order by, sort column:
src_ip, src_port, dst_ip, dst_port and shuffle partition just as file numbers.
+
+```
+INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port
from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
+OPTIMIZE conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
+```
+
+The complete code is as follows:
+
+```shell
+./spark-shell
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.SparkSession
+
+case class ConnRecord(src_ip: String, src_port: Int, dst_ip: String, dst_port:
Int)
+
+val conf = new SparkConf().setAppName("zorder_test")
+val spark =
SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
+import spark.implicits._
+
+val sc = spark.sparkContext
+sc.setLogLevel("WARN")
+//ten billion rows and two hundred files
+val numRecords = 10*1000*1000*1000L
+val numFiles = 200
+
+val dbName = s"zorder_test_$numFiles"
+val baseLocation = s"hdfs://localhost:9000/zorder_test/$dbName/"
+val connRandomParquet = "conn_random_parquet"
+val connZorderOnlyIp = "conn_zorder_only_ip"
+val connZorder = "conn_zorder"
+spark.conf.set("spark.sql.shuffle.partitions", numFiles)
+spark.conf.get("spark.sql.shuffle.partitions")
+spark.conf.set("spark.sql.hive.convertMetastoreParquet",false)
+spark.sql(s"drop database if exists $dbName cascade")
+spark.sql(s"create database if not exists $dbName")
+spark.sql(s"use $dbName")
+spark.sql(s"create table $connRandomParquet (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip
string, dst_port int) stored as parquet")
+spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int,
dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip
string, dst_port int) stored as parquet")
+spark.sql(s"show tables").show(false)
+
+import scala.util.Random
+// Function for preparing Zorder_Test data
+def randomIPv4(r: Random) = Seq.fill(4)(r.nextInt(256)).mkString(".")
+def randomPort(r: Random) = r.nextInt(65536)
+
+def randomConnRecord(r: Random) = ConnRecord(
+src_ip = randomIPv4(r), src_port = randomPort(r),
+dst_ip = randomIPv4(r), dst_port = randomPort(r))
+
+val df = spark.range(0, numFiles, 1, numFiles).mapPartitions { it =>
+val partitionID = it.toStream.head
+val r = new Random(seed = partitionID)
+Iterator.fill((numRecords / numFiles).toInt)(randomConnRecord(r))
+}
+
+df.write
+.mode("overwrite")
+.format("parquet")
+.insertInto(connRandomParquet)
+
+spark.read.table(connRandomParquet)
+.write
+.mode("overwrite")
+.format("parquet")
+.insertInto(connZorderOnlyIp)
+
+spark.read.table(connRandomParquet)
+.write
+.mode("overwrite")
+.format("parquet")
+.insertInto(connZorder)
+spark.stop()
+
+```
+
+Z-order Optimize statement:
+
+```sql
+set spark.sql.hive.convertMetastoreParquet=false;
+
+OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
+
+OPTIMIZE zorder_test.conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
+```
+
+ORDER BY statement:
+
+```
+INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip,
dst_port from conn_random_parquet order by src_ip, dst_ip;
+
+INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port
from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
+```
+
+Query statement:
+
+```sql
+set spark.sql.hive.convertMetastoreParquet=true;
+
+select count(*) from conn_random_parquet where src_ip like '157%' and dst_ip
like '216.%';
+
+select count(*) from conn_zorder_only_ip where src_ip like '157%' and dst_ip
like '216.%';
+
+select count(*) from conn_zorder where src_ip like '157%' and dst_ip like
'216.%';
+```
+
+### Z-order vs Order By Sort
+
+Compare the efficiency of Z-order Optimize versus Order by Sort.
+
+**10 billion data and 1000 files and Query resource: 200 core 600G memory**
+
+Z-order by or order by only ip:
+
+| Table | row count | optimize time |
+|---------------------|----------------|---------------|
+| conn_order_only_ip | 10,000,000,000 | 1591.99 s |
+| conn_zorder_only_ip | 10,000,000,000 | 8371.405 s |
+
+Z-order by or order by all columns:
+
+| Table | row count | optimize time |
+|-------------|----------------|---------------|
+| conn_order | 10,000,000,000 | 1515.298 s |
+| conn_zorder | 10,000,000,000 | 11057.194 s |
+
+### Z-order vs Random order
+
+Compare the efficiency of querying Z-order optimized data versus random data
order.
+
+**10 billion data and 200 files and Query resource: 200 core 600G memory**
+
+| Table | Average File Size | Scan row count | Average query
time | row count Skipping ratio |
+|---------------------|-------------------|----------------|--------------------|--------------------------|
+| conn_random_parquet | 1.2 G | 10,000,000,000 | 27.554 s
| 0.0% |
+| conn_zorder_only_ip | 890 M | 43,170,600 | 2.459 s
| 99.568% |
+| conn_zorder | 890 M | 54,841,302 | 3.185 s
| 99.451% |
+
+**10 billion data and 1000 files and Query resource: 200 core 600G memory**
+
+| Table | Average File Size | Scan row count | Average query
time | row count Skipping ratio |
+|---------------------|-------------------|----------------|--------------------|--------------------------|
+| conn_random_parquet | 234.8 M | 10,000,000,000 | 27.031 s
| 0.0% |
+| conn_zorder_only_ip | 173.9 M | 53,499,068 | 3.120 s
| 99.465% |
+| conn_zorder | 174.0 M | 35,910,500 | 3.103 s
| 99.640% |
+
+**1 billion data and 10000 files and Query resource: 10 core 40G memory**
+
+| Table | Average File Size | Scan row count | Average query
time | row count Skipping ratio |
+|---------------------|-------------------|----------------|--------------------|--------------------------|
+| conn_random_parquet | 2.7 M | 1,000,000,000 | 76.772 s
| 0.0% |
+| conn_zorder_only_ip | 2.1 M | 406,572 | 3.963 s
| 99.959% |
+| conn_zorder | 2.2 M | 387,942 | 3.621s
| 99.961% |
+