nsivabalan opened a new pull request, #18984:
URL: https://github.com/apache/hudi/pull/18984

   ## Summary
   - Adds opt-in `HiveDriverPool` for HiveQL partition sync (issue 
[#18331](https://github.com/apache/hudi/issues/18331))
   - Batches TOUCH using existing `hoodie.datasource.hive_sync.batch_num` (was 
one giant statement)
   - Fans batched ALTER statements across N pooled Hive `Driver` workers via 
dedicated single-thread executors
   - Default off — existing behavior unchanged unless 
`hoodie.datasource.hive_sync.batching.enabled=true`
   
   **Note: stacked on [#18983](https://github.com/apache/hudi/pull/18983).** 
The diff currently includes the HMS commit. Once #18983 merges, this PR will 
rebase cleanly onto master and the diff will shrink to just the HiveQL change 
(~664 added / 16 removed). Reviewing the top commit `49a70050c5ff` in isolation 
gives the HiveQL-only delta.
   
   ## Design constraint
   Hive's `Driver` and `SessionState` are thread-bound — `SessionState.start()` 
attaches to the calling thread's `ThreadLocal`, and a Driver constructed on one 
thread cannot be safely used from another. This is the opposite of 
`IMetaStoreClient` from #18983, which is a Thrift socket we can pool freely.
   
   The `HiveDriverPool` gives each slot its own dedicated worker thread (a 
`newSingleThreadExecutor`). Bootstrap, dispatch, and close all run on that 
bound thread. The `SessionState` itself is shared across workers (lazily 
constructed once) — each worker calls `SessionState.start(sharedState)` on its 
own thread to attach to its `ThreadLocal`. Constructing one SessionState per 
worker triggered races in Hive's resource-dir machinery during early testing.
   
   ## What this PR fixes
   
   1. **TOUCH batching** — 
`QueryBasedDDLExecutor.constructPartitionAlterStatements` was concatenating 
every partition into one `ALTER TABLE ... TOUCH PARTITION (p1) PARTITION (p2) 
...` statement. Now split into batches of `HIVE_BATCH_SYNC_PARTITION_NUM`.
   2. **HiveQL parallelism** — `HiveQueryDDLExecutor.updateHiveSQLs` ran the 
SQL list in a single `for` loop on one Driver. With the pool, each batch is 
dispatched to a worker (round-robin) and they execute in parallel.
   
   `SET_LOCATION` (UPDATE) remains one statement per partition because Hive SQL 
doesn't support multi-partition `SET LOCATION`. But it now benefits from the 
parallel fan-out.
   
   ## Hive 2.x quirk worth flagging
   `ALTER PARTITION SET LOCATION` in Hive 2.x ignores `db.tbl` qualifiers and 
silently uses the connection's current database. The leading `USE database` 
statement in the SQL list is therefore load-bearing — when the pool is in use, 
it peels off any leading `USE` statements via `runOnEachWorker()` and runs them 
on every worker before fanning the rest out. JDBC mode (which shares one 
Connection) preserves today's contract where `USE` persists for subsequent 
statements on the same Connection. (I initially tried qualifying every 
statement with `\`db\`.\`tbl\`` and dropping the `USE`; that broke JDBC tests, 
so reverted.)
   
   ## Configs
   No new configs — reuses everything from #18983:
   | Key | Default |
   |---|---|
   | `hoodie.datasource.hive_sync.batching.enabled` | `false` |
   | `hoodie.datasource.hive_sync.batching.threads` | `4` |
   | `hoodie.datasource.hive_sync.batch_num` | `1000` |
   
   ## Test plan
   - [x] `mvn compile` on `hudi-sync/hudi-hive-sync` — clean, 0 Checkstyle 
violations, 0 RAT issues
   - [x] `mvn test` on `hudi-sync/hudi-hive-sync` — **305 tests, 0 failures, 0 
errors**
   - [x] `TestHiveDriverPool` (new, 7 tests) — bootstrap, dispatch round-robin, 
error propagation, concurrent-borrow bounding, close idempotency
   - [x] `TestHiveSyncTool#testHiveQLSyncWithBatchingEnabled` (new) — 
end-to-end HiveQL sync with batching on, parametrized batch_num=3 threads=3 
against 10 + 4 partitions
   - [x] `TestHiveSyncTool#testHiveQLTouchPartitionsWithBatching` (new) — 
exercises the batched TOUCH path
   - [x] Existing 296 tests across all three sync modes (hms / hiveql / jdbc) — 
pass unchanged
   - [ ] Manual benchmark on a ~1k-partition table (planned before flipping 
default; not blocking)
   
   ## Files touched (top commit only)
   - `QueryBasedDDLExecutor.java` — batch TOUCH; new `runSQLs(List<String>)` 
hook for parallel-friendly subclasses
   - `HiveQueryDDLExecutor.java` — new constructor accepting `HiveDriverPool`; 
`runSQLs` peels off `USE` and dispatches via pool
   - `HoodieHiveSyncClient.java` — build pool for HIVEQL mode (explicit + 
legacy default)
   - `util/HiveDriverPool.java` — **new**, ~250 lines; eager pool of 
single-thread executors, each owning a Driver bound to a shared SessionState
   - `util/TestHiveDriverPool.java` — **new**; 7 unit tests with mocked 
`DriverFactory`
   - `TestHiveSyncTool.java` — 2 new end-to-end test methods
   
   ## Follow-ups (separate PRs)
   1. JDBC executor parallelism — needs a JDBC `Connection` pool, different 
concerns from `Driver`/`SessionState`.
   2. After benchmarks across HMS + HiveQL, post combined numbers on #18331 and 
decide whether default-off is still right.
   
   Related: #18331
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to