[GitHub] [iceberg] linyanghao opened a new issue, #5692: Spark: Check for hive support when using SparkSessionCatalog

GitBox Fri, 02 Sep 2022 01:04:36 -0700


linyanghao opened a new issue, #5692:
URL: https://github.com/apache/iceberg/issues/5692


   ### Apache Iceberg version
   
   0.14.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   If hive support is not enabled for SparkSession, Spark will use in-memory 
catalog for its built-in catalog. This leads to Iceberg's SparkSessionCatalog 
delegating requests to the in-memory catalog, which can be confusing since user 
expects requests to be handled by hive catalog. 
   For example, consider a hive metastore at thrift://localhost:9093 with 
namespaces:
   ```
   +---------+
   |namespace|
   +---------+
   |       db|
   |  default|
   +---------+
   ```
   and a table in namespace "db":
   ```
   +---------+---------+-----------+
   |namespace|tableName|isTemporary|
   +---------+---------+-----------+
   |       db|    table|      false|
   +---------+---------+-----------+
   ```
   The following steps will produce inconsistent results:
   1. Build a SparkSession without hive support:
   ```
   SparkSession spark = SparkSession
           .builder()
           .appName("SparkHiveTest")
           .master("local[*]")
           .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
           .config("spark.sql.catalog.spark_catalog", 
"org.apache.iceberg.spark.SparkSessionCatalog")
           .config("spark.sql.catalog.spark_catalog.type", "hive")
           .config("spark.sql.catalog.spark_catalog.uri", 
"thrift://localhost:9083")
           .config("spark.sql.catalog.spark_catalog.warehouse", 
"file:///tmp/spark-warehouse")
           .getOrCreate();
   ```
   2. List namespaces, which will be handled by spark's in-memory catalog, 
therefore not showing namespace "db":
   ```
   spark.sql("SHOW DATABASES").show();
   
   +---------+
   |namespace|
   +---------+
   |  default|
   +---------+
   ```
   3. Create namespace "db", which will be handled by spark's in-memory 
catalog, therefore succeeds:
   ```
   spark.sql("CREATE DATABASE db");
   ```
   4. Create table "db.table", which will be handled by iceberg's hive catalog, 
where "db.table" already exists, therefore fails:
   ```
   spark.sql("CREATE TABLE db.table (id bigint, data string) USING ICEBERG"); 
// fails: Table db.table already exists
   ```
   However, user expects step 4 to succeed, since "db" is a new empty namespace 
from user's perspective.
   
   To avoid such problems, we should add a check to ensure hive support for 
spark is enabled when initializing SparkSessionCatalog.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] linyanghao opened a new issue, #5692: Spark: Check for hive support when using SparkSessionCatalog

Reply via email to