[GitHub] [iceberg] Thijsvandepoll opened a new issue, #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

via GitHub Wed, 10 May 2023 02:08:46 -0700


Thijsvandepoll opened a new issue, #7574:
URL: https://github.com/apache/iceberg/issues/7574


   ### Apache Iceberg version
   
   1.2.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have been experiencing some (unexpected?) behavior where a catalog 
reference in Spark is not reflected in the Hive Metastore. I have followed the 
Spark configuration according to the 
[documentation](https://iceberg.apache.org/docs/latest/spark-configuration/), 
which looks like it should create a new catalog with the respective name. 
Everything works as expected, except for that the catalog is NOT being inserted 
in the Hive Metastore. This has some implications which I will showcase using 
an example.
   
   Here is sample script in PySpark:
   ```python
   import os
   from pyspark.sql import SparkSession
   
   deps = [
       "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.1",
       "org.apache.iceberg:iceberg-aws:1.2.1",
       "software.amazon.awssdk:bundle:2.17.257",
       "software.amazon.awssdk:url-connection-client:2.17.257"
   ]
   os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {','.join(deps)} 
pyspark-shell"
   os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
   os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
   os.environ["AWS_REGION"] = "eu-east-1"
   
   
   catalog = "hive_catalog"
   spark = SparkSession.\
       builder.\
       appName("Iceberg Reader").\
       config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions").\
       config(f"spark.sql.catalog.{catalog}", 
"org.apache.iceberg.spark.SparkCatalog").\
       config(f"spark.sql.catalog.{catalog}.type", "hive").\
       config(f"spark.sql.catalog.{catalog}.uri", "thrift://localhost:9083").\
       config(f"spark.sql.catalog.{catalog}.io-impl", 
"org.apache.iceberg.aws.s3.S3FileIO") .\
       config(f"spark.sql.catalog.{catalog}.s3.endpoint", 
"http://localhost:9000";).\
       config(f"spark.sql.catalog.{catalog}.warehouse", "s3a://lakehouse").\
       config("hive.metastore.uris", "thrift://localhost:9083").\
       enableHiveSupport().\
       getOrCreate()
   
   # Raises error
   spark.sql("CREATE NAMESPACE wrong_catalog.new_db;")
   
   # Correct creation of namespace
   spark.sql(f"CREATE NAMESPACE {catalog}.new_db;")
   
   # Create table
   spark.sql(f"CREATE TABLE {catalog}.new_db.new_table (col1 INT, col2 
STRING);")
   
   # Insert data
   spark.sql(f"INSERT INTO {catalog}.new_db.new_table VALUES (1, 'first'), (2, 
'second');")
   
   # Read data
   spark.sql(f"SELECT * FROM {catalog}.new_db.new_table;").show()
   #|col1|  col2|
   #+----+------+
   #|   1| first|
   #|   2|second|
   #+----+------+
   
   # Read metadata
   spark.sql(f"SELECT * FROM {catalog}.new_db.new_table.files;").show()
   
#+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
   #|content|           
file_path|file_format|spec_id|record_count|file_size_in_bytes|      
column_sizes|    value_counts|null_value_counts|nan_value_counts|        
lower_bounds|        
upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|    
readable_metrics|
   
#+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
   #|      0|s3a://lakehouse/n...|    PARQUET|      0|           1|             
  652|{1 -> 47, 2 -> 51}|{1 -> 1, 2 -> 1}| {1 -> 0, 2 -> 0}|              {}|{1 
-> ������, 2 -> ...|{1 -> ������, 2 -> ...|        null|          [4]|        
null|            0|{{47, 1, 0, null,...|
   #|      0|s3a://lakehouse/n...|    PARQUET|      0|           1|             
  660|{1 -> 47, 2 -> 53}|{1 -> 1, 2 -> 1}| {1 -> 0, 2 -> 0}|              {}|{1 
-> ������, 2 -> ...|{1 -> ������, 2 -> ...|        null|          [4]|        
null|            0|{{47, 1, 0, null,...|
   
#+-------+--------------------+-----------+-------+------------+------------------+------------------+----------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+--------------------+
   ```
   
   Now, this seems all good. It created a namespace, table and inserted data in 
the table. Now, showing results from the Hive Metastore shows what is the 
problem (`CTLGS`):
   ```
   |CTLG_ID|NAME|DESC                    |LOCATION_URI    |
   |-------|----|------------------------|----------------|
   |1      |hive|Default catalog for Hive|s3a://lakehouse/|
   ```
   It does NOT insert a new catalog with the respective catalog name. We can 
see that the namespaces and tables actually have been inserted in the Hive 
Metastore though (`DBS` and `TBLS`):
   ```
   |DB_ID|DESC                 |DB_LOCATION_URI          |NAME   |OWNER_NAME    
|OWNER_TYPE|CTLG_NAME|
   
|-----|---------------------|-------------------------|-------|--------------|----------|---------|
   |1    |Default Hive database|s3a://lakehouse/         |default|public        
|ROLE      |hive     |
   |2    |                     |s3a://lakehouse/new_db.db|new_db 
|thijsvandepoll|USER      |hive     |
   
   
   |TBL_ID|CREATE_TIME  |DB_ID|LAST_ACCESS_TIME|OWNER         
|OWNER_TYPE|RETENTION    |SD_ID|TBL_NAME |TBL_TYPE      
|VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|IS_REWRITE_ENABLED|
   
|------|-------------|-----|----------------|--------------|----------|-------------|-----|---------|--------------|------------------|------------------|------------------|
   |1     |1.683.707.647|2    |80.467          |thijsvandepoll|USER      
|2.147.483.647|1    |new_table|EXTERNAL_TABLE|                  |               
   |0                 |
   ```
   
   This means that it uses the Hive default catalog instead of the provided 
name. I am not exactly sure if this is expected behavior or unexpected 
behavior. Everything else works fine up to now. However, the problem exists 
when we want to create another the same namespace but in another catalog:
   
   ```
   import os
   from pyspark.sql import SparkSession
   
   deps = [
       "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.1",
       "org.apache.iceberg:iceberg-aws:1.2.1",
       "software.amazon.awssdk:bundle:2.17.257",
       "software.amazon.awssdk:url-connection-client:2.17.257"
   ]
   os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {','.join(deps)} 
pyspark-shell"
   os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
   os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
   os.environ["AWS_REGION"] = "eu-east-1"
   
   catalog = "other_catalog"
   spark = SparkSession.\
       builder.\
       appName("Iceberg Reader").\
       config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions").\
       config(f"spark.sql.catalog.{catalog}", 
"org.apache.iceberg.spark.SparkCatalog").\
       config(f"spark.sql.catalog.{catalog}.type", "hive").\
       config(f"spark.sql.catalog.{catalog}.uri", "thrift://localhost:9083").\
       config(f"spark.sql.catalog.{catalog}.io-impl", 
"org.apache.iceberg.aws.s3.S3FileIO") .\
       config(f"spark.sql.catalog.{catalog}.s3.endpoint", 
"http://localhost:9000";).\
       config(f"spark.sql.catalog.{catalog}.warehouse", "s3a://lakehouse").\
       config("hive.metastore.uris", "thrift://localhost:9083").\
       enableHiveSupport().\
       getOrCreate()
   
   # Error that catalog already exists
   spark.sql(f"CREATE NAMESPACE {catalog}.new_db;")
   # pyspark.sql.utils.AnalysisException: Namespace 'new_db' already exists
   
   # Create another namespace
   spark.sql(f"CREATE NAMESPACE {catalog}.other_db;")
   
   # Try to access data from other catalog using current catalog
   spark.sql("SELECT * FROM {catalog}.new_db.new_table;").show()
   #|col1|  col2|
   #+----+------+
   #|   1| first|
   #|   2|second|
   #+----+------+
   ```
   
   Now we can see that even though we are referencing another catalog, it still 
uses the Hive default catalog implicitly. We can see that by viewing `DBS` in 
the Hive Metastore:
   ```
   |DB_ID|DESC                 |DB_LOCATION_URI            |NAME    |OWNER_NAME 
   |OWNER_TYPE|CTLG_NAME|
   
|-----|---------------------|---------------------------|--------|--------------|----------|---------|
   |1    |Default Hive database|s3a://lakehouse/           |default |public     
   |ROLE      |hive     |
   |2    |                     |s3a://lakehouse/new_db.db  |new_db  
|thijsvandepoll|USER      |hive     |
   |3    |                     
|s3a://lakehouse/other_db.db|other_db|thijsvandepoll|USER      |hive     |
   ```
   
   
   Basically this means that Iceberg together with the Hive Metastore does not 
have a notion of a catalog. It is just a list of namespaces + tables which can 
be defined. It is actually a single catalog so it seems.
   
   
   Can anyone help me understand what is going on? Do I miss configurations? Is 
this expected behavior or a bug? Thanks in advance!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Thijsvandepoll opened a new issue, #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default

Reply via email to