[PR] feat(hive-sync): add Spark catalog metastore client for Hive sync on Spark [hudi]

via GitHub Sat, 14 Feb 2026 20:17:18 -0800


suryaprasanna opened a new pull request, #18203:
URL: https://github.com/apache/hudi/pull/18203


   ### Describe the issue this Pull Request addresses
   
   Hive sync in Spark-based environments can fail or depend on Hive 
metastore/thrift classes that are not always available or compatible at 
runtime. This causes sync instability even when the table lifecycle is managed 
through Spark SQL catalog APIs.
   
   This change enables Hive sync to use a Spark-catalog-backed 
`IMetaStoreClient`, so metadata operations (table/partition/schema updates) can 
run reliably in Hive-on-Spark setups without requiring a fully functional 
external HMS client path.
   
   ### Summary and Changelog
   
   - Added `SparkCatalogMetaStoreClient` that implements `IMetaStoreClient` 
using Spark external catalog APIs for supported operations.
   - Added `hoodie.datasource.hive_sync.use_spark_catalog` config and wired it 
through Hive sync config plumbing.
   - Updated `HoodieHiveSyncClient` to instantiate the Spark-catalog metastore 
client when the new config is enabled.
   - Added end-to-end Spark catalog sync tests in `TestSparkCatalogSync` for:
     - initial table and partition registration,
     - new partition registration after append writes,
     - partition drop visibility,
     - schema evolution visibility in catalog.
   - Included follow-up fixes to make the Spark-catalog client compatible with 
Hive sync metadata updates in test/runtime flows.
   
   ### Impact
   
   - New optional behavior gated by 
`hoodie.datasource.hive_sync.use_spark_catalog` (default remains unchanged).
   - Improves reliability of Hive sync for Spark environments where direct 
HMS/thrift dependencies are unavailable or fragile.
   - No behavior change for existing users unless the new config is explicitly 
enabled.
   
   ### Risk Level
   
   low
   
   The new path is opt-in and covered by end-to-end tests for partition 
lifecycle and schema evolution. Default sync path is unchanged.
   
   ### Documentation Update
   
   Config-level documentation is included in code for the new 
`hoodie.datasource.hive_sync.use_spark_catalog` option. No additional 
website/doc update is required for this internal sync-path enhancement.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   
   Made with [Cursor](https://cursor.com)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(hive-sync): add Spark catalog metastore client for Hive sync on Spark [hudi]

Reply via email to