[I] Accessing catalog in Spark datasource [hudi]

via GitHub Mon, 13 Oct 2025 10:52:47 -0700


linliu-code opened a new issue, #14081:
URL: https://github.com/apache/hudi/issues/14081


   ### Bug Description
   
   **What happened:**
   When we enable catalog (external ones or local spark catalog), and create a 
schema in the catalog either from Spark SQL queries, or through MetaSync. Later 
when we use Spark DS to create a new table with the same name, and Hudi cannot 
find table schemas from storage, then it would contact the catalog for table 
schema. This may cause some issues like,
   1. User A creates a table with `table_name` in the catalog either using 
`Spark SQL` or `spark.sql()`;
   2. User B enables the catalog using Spark datasource, and insert/bulk_insert 
a different table with the same name `table_name`, but different schema,  then 
the queries from B could would fail since the table in catalog and table 
created by B are not compatible for schema evolution.
   
   **What you expected:**
   the queries from B should not access catalog for the table from a different 
session.
   
   **Steps to reproduce:**
   ```
   ~/spark/spark-3.3.4-bin-hadoop3/bin/spark-shell  --packages 
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
   --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
   
   
   
   import java.util.UUID
   val tableName = UUID.randomUUID().toString.replaceAll("-","_")
   val basePath = "file:///tmp/trips_table" + UUID.randomUUID().toString
   
   spark.sql(s"""CREATE TABLE ${tableName} (
       ts BIGINT,
       uuid STRING,
       rider STRING,
       driver STRING,
       fare DOUBLE,
       city STRING
   ) USING HUDI
   PARTITIONED BY (city);""")
   
   val columns = Seq("ts1","uuid","rider","driver","fare","city")
   val data =
     
Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
       
(1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));
   var inserts = spark.createDataFrame(data).toDF(columns:_*)
   inserts.write.format("hudi").
     option("hoodie.datasource.write.partitionpath.field", "city").
     option("hoodie.table.name", tableName).
     mode("overwrite").
     save(basePath)
   ```
   
   ### Environment
   
   **Hudi version:**
   **Query engine:** (Spark/Flink/Trino etc)
   **Relevant configs:**
   
   
   ### Logs and Stack Trace
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Accessing catalog in Spark datasource [hudi]

Reply via email to