[GitHub] [spark] juliuszsompolski commented on pull request #30919: [SPARK-33904][SQL] Recognize `spark_catalog` in `saveAsTable()` and `insertInto()`

GitBox Wed, 30 Dec 2020 02:29:11 -0800


juliuszsompolski commented on pull request #30919:
URL: https://github.com/apache/spark/pull/30919#issuecomment-752407993



   We might do some future work with Simba about it.
   To support 3-part catalog.database.table identifiers, currently Simba 
JDBC/ODBC drivers accept a catalog named "SPARK" and drop it when translating 
the queries to Spark, even with UseNativeQuery=1:
   ```
   scala> val conn = 
java.sql.DriverManager.getConnection("jdbc:spark://<...>.databricks.com:443/default;transportMode=http;ssl=1;httpPath=...;AuthMech=3;UID=...;PWD=...;UseNativeQuery=1")
   conn: java.sql.Connection = 
com.simba.spark.hivecommon.jdbc42.Hive42Connection@2484dbb7
   scala> stmt.execute("CREATE TABLE SPARK.default.catalogtest(foo int)")
   res0: Boolean = false
   scala> stmt.executeQuery("SELECT * FROM SPARK.default.catalogtest")
   res2: java.sql.ResultSet = 
com.simba.spark.jdbc.jdbc42.S42ForwardResultSet@298e002d
   ```
   the actual queries sent to Thriftserver are `CREATE TABLE 
default.catalogtest(foo int)` and `SELECT * FROM default.catalogtest`, Simba 
just drops the catalog name "SPARK" from the queries.
   
   Simba will also return a canned response with a single catalog "Spark" to a 
metadata getCatalogs call.
   ```
   scala> conn.getMetaData.getCatalogs
   res3: java.sql.ResultSet = 
com.simba.spark.jdbc.jdbc42.S42MetaDataProxy@51d9fd30
   scala> res3.next()
   res4: Boolean = true
   scala> res3.getObject(1)
   res5: Object = Spark
   scala> res3.next()
   res8: Boolean = false
   ```
   
   Thriftserver SparkGetCatalogsOperation just returns empty; Simba drivers 
ignore it.
   
   However, the following also seems to work correctly already:
   ```
   scala> stmt.execute("CREATE TABLE spark_catalog.default.catalogtest2(foo 
int)")
   res11: Boolean = false
   
   scala> stmt.executeQuery("SELECT * FROM SPARK.default.catalogtest2")
   res12: java.sql.ResultSet = 
com.simba.spark.jdbc.jdbc42.S42ForwardResultSet@59845a40
   
   scala> stmt.executeQuery("SELECT * FROM spark_catalog.default.catalogtest2")
   res13: java.sql.ResultSet = 
com.simba.spark.jdbc.jdbc42.S42ForwardResultSet@41433530
   
   scala> stmt.execute("CREATE TABLE spark_catalog2.default.catalogtest2(foo 
int)")
   java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing 
query/statement. Error Code: 0, SQL state: Error running query: 
org.apache.spark.sql.AnalysisException: The namespace in session catalog must 
have exactly one name part: spark_catalog2.default.catalogtest2;, Query: CREATE 
TABLE spark_catalog2.default.catalogtest2(foo int).
     at 
com.simba.spark.hivecommon.api.HS2Client.pollForOperationCompletion(Unknown 
Source)
     at 
com.simba.spark.hivecommon.api.HS2Client.executeStatementInternal(Unknown 
Source)
     at com.simba.spark.hivecommon.api.HS2Client.executeStatement(Unknown 
Source)
     at 
com.simba.spark.hivecommon.dataengine.HiveJDBCNativeQueryExecutor.executeHelper(Unknown
 Source)
     at 
com.simba.spark.hivecommon.dataengine.HiveJDBCNativeQueryExecutor.execute(Unknown
 Source)
     at com.simba.spark.jdbc.common.SStatement.executeNoParams(Unknown Source)
     at com.simba.spark.jdbc.common.SStatement.execute(Unknown Source)
     ... 31 elided
   Caused by: com.simba.spark.support.exceptions.GeneralException: 
[Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 
0, SQL state: Error running query: org.apache.spark.sql.AnalysisException: The 
namespace in session catalog must have exactly one name part: 
spark_catalog2.default.catalogtest2;, Query: CREATE TABLE 
spark_catalog2.default.catalogtest2(foo int).
     ... 38 more
   ```
   For catalog names other than "SPARK", Simba drivers just forward them 
verbatim.
   
   What are Spark plans of supporting multiple catalogs? Should we start 
returning it via SparkGetCatalogsOperation, and should Simba start respecting 
the catalogs that are returned there, and drop it's own "SPARK" catalog 
placeholder? I think there are some existing downstream connectors (I think 
Alation) that depend on "SPARK" as catalog name, so Simba might need to keep 
"SPARK" as a special default catalog.
   
   cc @wangyum @bogdanghit 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] juliuszsompolski commented on pull request #30919: [SPARK-33904][SQL] Recognize `spark_catalog` in `saveAsTable()` and `insertInto()`

Reply via email to