[GitHub] [spark] cloud-fan commented on a change in pull request #29939: [SPARK-33062][SQL] Make DataFrameReader.jdbc work for DataSource V2

GitBox Fri, 09 Oct 2020 06:28:36 -0700


cloud-fan commented on a change in pull request #29939:
URL: https://github.com/apache/spark/pull/29939#discussion_r501441101




##########
File path: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
##########
@@ -221,4 +221,21 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession {
       checkAnswer(sql("SELECT name, id FROM h2.test.abc"), Row("bob", 4))
     }
   }
+
+  test("DataFrameReader: jdbc") {
+    withTable("h2.test.abc") {
+      sql("CREATE TABLE h2.test.abc USING _ AS SELECT * FROM h2.test.people")
+      val properties = new Properties()
+      val df1 = spark.read.jdbc(url, "h2.test.abc", properties)

Review comment:
       I'm a bit confused about this. There are 3 ways to use JDBC data source:
   1. use `DataFrameReader/Writer` API to access JDBC tables/queries directly.
   1. register as a table, and access the table.
   1. register as a catalog, and access tables inside the catalog.
   
   `spark.read.jdbc(url, "h2.test.abc", properties)` seems like a mix of 1 and 
3. What's the use case you are targeting?
   

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
##########
@@ -221,4 +221,21 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession {
       checkAnswer(sql("SELECT name, id FROM h2.test.abc"), Row("bob", 4))
     }
   }
+
+  test("DataFrameReader: jdbc") {
+    withTable("h2.test.abc") {
+      sql("CREATE TABLE h2.test.abc USING _ AS SELECT * FROM h2.test.people")
+      val properties = new Properties()
+      val df1 = spark.read.jdbc(url, "h2.test.abc", properties)

Review comment:
       We need to distinguish between APIs and shortcuts. For 
`DataFrameWriter`, it has 3 APIs: `save`, `insertInto` and `saveAsTable`. 
`parquet`, `json`, `jdbc`, etc. are shortcuts and eventually calls `save()`.
   
   For `insertInto` and `saveAsTable`, they take Spark table name and should 
support multi catalogs. For `save`, it interacts with data source directly with 
options, and thus shouldn't support multi-catalog.
   
   For this particular test, it looks confusing as the registered JDBC catalog 
should already have the url config, why do we need to specify it again in 
`spark.read.jdbc`?

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
##########
@@ -221,4 +221,21 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession {
       checkAnswer(sql("SELECT name, id FROM h2.test.abc"), Row("bob", 4))
     }
   }
+
+  test("DataFrameReader: jdbc") {
+    withTable("h2.test.abc") {
+      sql("CREATE TABLE h2.test.abc USING _ AS SELECT * FROM h2.test.people")
+      val properties = new Properties()
+      val df1 = spark.read.jdbc(url, "h2.test.abc", properties)

Review comment:
       In the doc of `spark.read.jdbc`: `@param table Name of the table in the 
external database.`
   
   This is not a spark table name, but a table name in the remote JDBC server 
such as MySQL.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #29939: [SPARK-33062][SQL] Make DataFrameReader.jdbc work for DataSource V2

Reply via email to