[PR] feat(hive-sync): batch and parallelize HMS partition operations [hudi]

via GitHub Thu, 11 Jun 2026 16:56:28 -0700


nsivabalan opened a new pull request, #18983:
URL: https://github.com/apache/hudi/pull/18983


   ## Summary
   - Adds opt-in `IMetaStoreClientPool` for HMS partition sync (issue 
[#18331](https://github.com/apache/hudi/issues/18331))
   - Batches ADD/TOUCH/UPDATE/DROP using the existing 
`hoodie.datasource.hive_sync.batch_num` (default 1000)
   - Fans batches out across N pooled `RetryingMetaStoreClient` instances via a 
fixed-size executor
   - Default off — existing behavior unchanged unless 
`hoodie.datasource.hive_sync.batching.enabled=true`
   
   ## Design invariant
   Only partition-row operations (`add_partitions`, `alter_partitions`, 
`dropPartition`, `getPartition`) go through the pool. Table-row operations 
(`createTable`, `alter_table`, `updateLastCommitTimeSynced`, 
`updateHoodieWriterVersion`, `updateTableComments`) stay on the existing 
session `IMetaStoreClient` held by `HoodieHiveSyncClient`. The sync flow is 
therefore **serial → parallel → serial**:
   
   1. Table setup (single client)
   2. Partition fan-out across the pool
   3. Table finalization (single client)
   
   This eliminates lost-update risk on `Table.parameters` (the 
read-modify-write pattern used by `updateLastCommitTimeSynced` and 
`updateHoodieWriterVersion`).
   
   ## New configs
   | Key | Default | Purpose |
   |---|---|---|
   | `hoodie.datasource.hive_sync.batching.enabled` | `false` | Master feature 
flag |
   | `hoodie.datasource.hive_sync.batching.threads` | `4` | Pool size + worker 
thread count |
   
   Reuses the existing `hoodie.datasource.hive_sync.batch_num` for batch sizing 
(no new batch-size config).
   
   ## Scope
   HMS executor only — matches the branch name `hms-parallelize-calls`. HiveQL 
and JDBC executor parallelism is deferred (per-thread `SessionState`/`Driver` 
and JDBC `Connection` pools are bigger changes; gating on benchmarks of this PR 
first).
   
   ## Failure semantics
   - **Today (sequential):** batch N fails → N+1..K never run. Predictable 
prefix.
   - **New (parallel):** in-flight batches complete; first exception is thrown, 
remaining errors logged at WARN. Re-running sync is already idempotent 
(`add_partitions(list, ifNotExists=true)`, `partitionExists` guard before 
`dropPartition`, idempotent `alter_partitions`), so partial-state retry 
behavior matches today.
   
   ## Compatibility
   When `HIVE_SYNC_USE_SPARK_CATALOG=true`, the pool path is skipped with a 
warning and we fall back to sequential — the reflection-built 
`SparkCatalogMetaStoreClient` isn't compatible with the direct 
`RetryingMetaStoreClient.getProxy` construction path.
   
   ## Test plan
   - [x] `mvn compile` on `hudi-sync/hudi-hive-sync` — clean, 0 Checkstyle 
violations, 0 RAT issues
   - [x] `mvn test` on `hudi-sync/hudi-hive-sync` — **296 tests, 0 failures, 0 
errors**
   - [x] `TestIMetaStoreClientPool` (new, 8 tests) — borrow/return on success, 
on failure, concurrent borrow bounded by pool size, idempotent close, executor 
lifecycle, size validation
   - [x] `TestHiveSyncTool#testHMSSyncWithBatchingEnabled` (new) — end-to-end 
HMS sync with `batching.enabled=true`, `threads=3`, `batch_num=3`, 10 initial + 
4 incremental partitions
   - [ ] Manual benchmark on a 2k-partition table (planned before flipping 
default; not blocking this PR since flag is opt-in)
   
   ## Files touched
   - `HiveSyncConfigHolder.java` — 2 new `ConfigProperty` constants
   - `HoodieHiveSyncClient.java` — owns + closes the pool, builds it only for 
HMS mode with flag on
   - `ddl/HMSDDLExecutor.java` — new constructor accepting the pool, 
`runBatches` helper shared across all four partition methods
   - `util/IMetaStoreClientPool.java` — **new**; modeled on Iceberg's 
`HiveClientPool` but standalone (no Iceberg dep)
   - `TestIMetaStoreClientPool.java` — **new**; mock-based unit tests
   - `TestHiveSyncTool.java` — new end-to-end test method
   
   ## Follow-ups (separate PRs)
   1. Respond to issue #18331 noting a minor inaccuracy: JDBC's DROP is already 
batched 
([`JDBCExecutor.constructDropPartitions`](https://github.com/apache/hudi/blob/master/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/JDBCExecutor.java)).
 The real JDBC gap is `SET LOCATION` (UPDATE), which is one statement per 
partition with no batching.
   2. After benchmarks, decide whether HiveQL + JDBC parallelism is worth the 
additional per-thread `SessionState`/`Connection` complexity.
   
   Related: #18331
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(hive-sync): batch and parallelize HMS partition operations [hudi]

Reply via email to