[GitHub] [incubator-kyuubi] yaooqinn commented on a diff in pull request #3211: [Subtask] Connectors for Spark SQL Query Engine -> TPC-H

GitBox Wed, 10 Aug 2022 22:26:12 -0700


yaooqinn commented on code in PR #3211:
URL: https://github.com/apache/incubator-kyuubi/pull/3211#discussion_r943104927



##########
docs/connector/spark/tpch.rst:
##########
@@ -16,19 +16,80 @@
 TPC-H
 =====
 
-TPC-DS Integration
+The TPC-H is a decision support benchmark. It consists of a suite of business 
oriented ad-hoc queries and concurrent
+data modifications. The queries and the data populating the database have been 
chosen to have broad industry-wide
+relevance.
+
+.. tip::
+   This article assumes that you have mastered the basic knowledge and 
operation of `TPC-H`_.
+   For the knowledge about TPC-H not mentioned in this article, you can obtain 
it from its `Official Documentation`_.
+
+This connector can be used to test the capabilities and query syntax of Spark 
without configuring access to an external
+data source. When you query a TPC-H table, the connector generates the data on 
the fly using a deterministic algorithm.
+
+Goto `Try Kyuubi`_ to explore TPC-H data instantly!
+
+TPC-H Integration
 ------------------
 
+To enable the integration of kyuubi spark sql engine and TPC-H through
+Apache Spark Datasource V2 and Catalog APIs, you need to:
+
+- Referencing the TPC-H connector :ref:`dependencies<spark-tpch-deps>`
+- Setting the spark catalog :ref:`configurations<spark-tpch-conf>`
+
 .. _spark-tpch-deps:
 
 Dependencies
 ************
 
+The **classpath** of kyuubi spark sql engine with TiDB supported consists of
+
+1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed 
with Kyuubi distributions
+2. a copy of spark distribution
+3. kyuubi-spark-connector-tpch-\ |release|\ _2.12.jar, which can be found in 
the `Maven Central`_
+
+In order to make the TPC-H connector package visible for the runtime classpath 
of engines, we can use one of these methods:
+
+1. Put the TPC-H connector package into ``$SPARK_HOME/jars`` directly
+2. Set spark.jars=kyuubi-spark-connector-tpch-\ |release|\ _2.12.jar
+
 .. _spark-tpch-conf:
 
 Configurations
 **************
 
+To add TPC-H tables as a catalog, we can set the following configurations:
+
+.. code-block:: properties
+
+   spark.sql.catalog.tpch=org.apache.kyuubi.spark.connector.tpch.TPCHCatalog
+   spark.sql.catalog.tpch.excludeDatabases=sf10000,sf30000  # optional Exclude 
database list from the catalog
+   spark.sql.catalog.tpch.useAnsiStringType=false           # optional When 
true, use CHAR VARCHAR; otherwise use STRING
+   spark.sql.catalog.tpch.read.maxPartitionBytes=134217728  # optional Max 
data split size in bytes per task, consider to reduce it if you want a higher 
parallelism.

Review Comment:
   ```suggestion
      # (required) Register a catalog named `tpch` for the spark engine.
      spark.sql.catalog.tpch=org.apache.kyuubi.spark.connector.tpch.TPCHCatalog
      #  Exclude database list from the catalog
      spark.sql.catalog.tpch.excludeDatabases=sf10000,sf30000
      # (optional) When true, use CHAR/VARCHAR; otherwise use STRING
      spark.sql.catalog.tpch.useAnsiStringType=false
      # (optional) Maximum bytes per task, consider reducing it if you want 
higher parallelism.
      spark.sql.catalog.tpch.read.maxPartitionBytes=134217728
   ```
   
   Questions,
   
   1. the catalog name `tpch` is hard coded, or configurable
   2. if configurable, are we able to register multiple tpch catalogs, such as 
tpch1, tpch2 ...
   3. what happened if `excludeDatabases` contains invalid databases, if not 
what are all the candidates
   4. what are the differences in user experiences for char/varchars and strings



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-kyuubi] yaooqinn commented on a diff in pull request #3211: [Subtask] Connectors for Spark SQL Query Engine -> TPC-H

Reply via email to