yaooqinn commented on code in PR #3211:
URL: https://github.com/apache/incubator-kyuubi/pull/3211#discussion_r943104927
##########
docs/connector/spark/tpch.rst:
##########
@@ -16,19 +16,80 @@
TPC-H
=====
-TPC-DS Integration
+The TPC-H is a decision support benchmark. It consists of a suite of business
oriented ad-hoc queries and concurrent
+data modifications. The queries and the data populating the database have been
chosen to have broad industry-wide
+relevance.
+
+.. tip::
+ This article assumes that you have mastered the basic knowledge and
operation of `TPC-H`_.
+ For the knowledge about TPC-H not mentioned in this article, you can obtain
it from its `Official Documentation`_.
+
+This connector can be used to test the capabilities and query syntax of Spark
without configuring access to an external
+data source. When you query a TPC-H table, the connector generates the data on
the fly using a deterministic algorithm.
+
+Goto `Try Kyuubi`_ to explore TPC-H data instantly!
+
+TPC-H Integration
------------------
+To enable the integration of kyuubi spark sql engine and TPC-H through
+Apache Spark Datasource V2 and Catalog APIs, you need to:
+
+- Referencing the TPC-H connector :ref:`dependencies<spark-tpch-deps>`
+- Setting the spark catalog :ref:`configurations<spark-tpch-conf>`
+
.. _spark-tpch-deps:
Dependencies
************
+The **classpath** of kyuubi spark sql engine with TiDB supported consists of
+
+1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed
with Kyuubi distributions
+2. a copy of spark distribution
+3. kyuubi-spark-connector-tpch-\ |release|\ _2.12.jar, which can be found in
the `Maven Central`_
+
+In order to make the TPC-H connector package visible for the runtime classpath
of engines, we can use one of these methods:
+
+1. Put the TPC-H connector package into ``$SPARK_HOME/jars`` directly
+2. Set spark.jars=kyuubi-spark-connector-tpch-\ |release|\ _2.12.jar
+
.. _spark-tpch-conf:
Configurations
**************
+To add TPC-H tables as a catalog, we can set the following configurations:
+
+.. code-block:: properties
+
+ spark.sql.catalog.tpch=org.apache.kyuubi.spark.connector.tpch.TPCHCatalog
+ spark.sql.catalog.tpch.excludeDatabases=sf10000,sf30000 # optional Exclude
database list from the catalog
+ spark.sql.catalog.tpch.useAnsiStringType=false # optional When
true, use CHAR VARCHAR; otherwise use STRING
+ spark.sql.catalog.tpch.read.maxPartitionBytes=134217728 # optional Max
data split size in bytes per task, consider to reduce it if you want a higher
parallelism.
Review Comment:
```suggestion
# (required) Register a catalog named `tpch` for the spark engine.
spark.sql.catalog.tpch=org.apache.kyuubi.spark.connector.tpch.TPCHCatalog
# Exclude database list from the catalog
spark.sql.catalog.tpch.excludeDatabases=sf10000,sf30000
# (optional) When true, use CHAR/VARCHAR; otherwise use STRING
spark.sql.catalog.tpch.useAnsiStringType=false
# (optional) Maximum bytes per task, consider reducing it if you want
higher parallelism.
spark.sql.catalog.tpch.read.maxPartitionBytes=134217728
```
Questions,
1. the catalog name `tpch` is hard coded, or configurable
2. if configurable, are we able to register multiple tpch catalogs, such as
tpch1, tpch2 ...
3. what happened if `excludeDatabases` contains invalid databases, if not
what are all the candidates
4. what are the differences in user experiences for char/varchars and strings
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]