nsivabalan opened a new pull request, #18984: URL: https://github.com/apache/hudi/pull/18984
## Summary - Adds opt-in `HiveDriverPool` for HiveQL partition sync (issue [#18331](https://github.com/apache/hudi/issues/18331)) - Batches TOUCH using existing `hoodie.datasource.hive_sync.batch_num` (was one giant statement) - Fans batched ALTER statements across N pooled Hive `Driver` workers via dedicated single-thread executors - Default off — existing behavior unchanged unless `hoodie.datasource.hive_sync.batching.enabled=true` **Note: stacked on [#18983](https://github.com/apache/hudi/pull/18983).** The diff currently includes the HMS commit. Once #18983 merges, this PR will rebase cleanly onto master and the diff will shrink to just the HiveQL change (~664 added / 16 removed). Reviewing the top commit `49a70050c5ff` in isolation gives the HiveQL-only delta. ## Design constraint Hive's `Driver` and `SessionState` are thread-bound — `SessionState.start()` attaches to the calling thread's `ThreadLocal`, and a Driver constructed on one thread cannot be safely used from another. This is the opposite of `IMetaStoreClient` from #18983, which is a Thrift socket we can pool freely. The `HiveDriverPool` gives each slot its own dedicated worker thread (a `newSingleThreadExecutor`). Bootstrap, dispatch, and close all run on that bound thread. The `SessionState` itself is shared across workers (lazily constructed once) — each worker calls `SessionState.start(sharedState)` on its own thread to attach to its `ThreadLocal`. Constructing one SessionState per worker triggered races in Hive's resource-dir machinery during early testing. ## What this PR fixes 1. **TOUCH batching** — `QueryBasedDDLExecutor.constructPartitionAlterStatements` was concatenating every partition into one `ALTER TABLE ... TOUCH PARTITION (p1) PARTITION (p2) ...` statement. Now split into batches of `HIVE_BATCH_SYNC_PARTITION_NUM`. 2. **HiveQL parallelism** — `HiveQueryDDLExecutor.updateHiveSQLs` ran the SQL list in a single `for` loop on one Driver. With the pool, each batch is dispatched to a worker (round-robin) and they execute in parallel. `SET_LOCATION` (UPDATE) remains one statement per partition because Hive SQL doesn't support multi-partition `SET LOCATION`. But it now benefits from the parallel fan-out. ## Hive 2.x quirk worth flagging `ALTER PARTITION SET LOCATION` in Hive 2.x ignores `db.tbl` qualifiers and silently uses the connection's current database. The leading `USE database` statement in the SQL list is therefore load-bearing — when the pool is in use, it peels off any leading `USE` statements via `runOnEachWorker()` and runs them on every worker before fanning the rest out. JDBC mode (which shares one Connection) preserves today's contract where `USE` persists for subsequent statements on the same Connection. (I initially tried qualifying every statement with `\`db\`.\`tbl\`` and dropping the `USE`; that broke JDBC tests, so reverted.) ## Configs No new configs — reuses everything from #18983: | Key | Default | |---|---| | `hoodie.datasource.hive_sync.batching.enabled` | `false` | | `hoodie.datasource.hive_sync.batching.threads` | `4` | | `hoodie.datasource.hive_sync.batch_num` | `1000` | ## Test plan - [x] `mvn compile` on `hudi-sync/hudi-hive-sync` — clean, 0 Checkstyle violations, 0 RAT issues - [x] `mvn test` on `hudi-sync/hudi-hive-sync` — **305 tests, 0 failures, 0 errors** - [x] `TestHiveDriverPool` (new, 7 tests) — bootstrap, dispatch round-robin, error propagation, concurrent-borrow bounding, close idempotency - [x] `TestHiveSyncTool#testHiveQLSyncWithBatchingEnabled` (new) — end-to-end HiveQL sync with batching on, parametrized batch_num=3 threads=3 against 10 + 4 partitions - [x] `TestHiveSyncTool#testHiveQLTouchPartitionsWithBatching` (new) — exercises the batched TOUCH path - [x] Existing 296 tests across all three sync modes (hms / hiveql / jdbc) — pass unchanged - [ ] Manual benchmark on a ~1k-partition table (planned before flipping default; not blocking) ## Files touched (top commit only) - `QueryBasedDDLExecutor.java` — batch TOUCH; new `runSQLs(List<String>)` hook for parallel-friendly subclasses - `HiveQueryDDLExecutor.java` — new constructor accepting `HiveDriverPool`; `runSQLs` peels off `USE` and dispatches via pool - `HoodieHiveSyncClient.java` — build pool for HIVEQL mode (explicit + legacy default) - `util/HiveDriverPool.java` — **new**, ~250 lines; eager pool of single-thread executors, each owning a Driver bound to a shared SessionState - `util/TestHiveDriverPool.java` — **new**; 7 unit tests with mocked `DriverFactory` - `TestHiveSyncTool.java` — 2 new end-to-end test methods ## Follow-ups (separate PRs) 1. JDBC executor parallelism — needs a JDBC `Connection` pool, different concerns from `Driver`/`SessionState`. 2. After benchmarks across HMS + HiveQL, post combined numbers on #18331 and decide whether default-off is still right. Related: #18331 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
