[PR] feat(hive-sync): batch and parallelize JDBC partition operations [hudi]

via GitHub Thu, 11 Jun 2026 21:02:11 -0700


nsivabalan opened a new pull request, #18986:
URL: https://github.com/apache/hudi/pull/18986


   ## Summary
   - Adds opt-in `JDBCConnectionPool` for JDBC partition sync (issue 
[#18331](https://github.com/apache/hudi/issues/18331))
   - Fans batched ALTER statements across N pooled `java.sql.Connection` 
instances
   - Default off — existing behavior unchanged unless 
`hoodie.datasource.hive_sync.batching.enabled=true`
   
   **Note: stacked on [#18984](https://github.com/apache/hudi/pull/18984) 
(which is stacked on [#18983](https://github.com/apache/hudi/pull/18983)).** 
The diff currently includes both upstream commits. Once #18983 and #18984 
merge, this PR will rebase cleanly onto master and the diff will shrink to just 
the JDBC change (~581 added / 3 removed). Reviewing the top commit 
`f2b079109192` in isolation gives the JDBC-only delta.
   
   ## Design constraint
   JDBC `Connection` is not thread-safe but is *cheap to construct* (one TCP 
socket to HiveServer2). So the design mirrors 
[#18983](https://github.com/apache/hudi/pull/18983)'s `IMetaStoreClientPool` 
more than #18984's `HiveDriverPool` — simple borrow/return with no 
thread-binding. Each pooled connection is its own HiveServer2 session.
   
   ## What's actually new
   The batching gaps for JDBC were small:
   - **ADD** — already batched in 
`QueryBasedDDLExecutor.constructAddPartitions` (parent class).
   - **DROP** — already batched in `JDBCExecutor.constructDropPartitions`.
   - **TOUCH** — became batched in #18984's `constructPartitionAlterStatements` 
change.
   - **SET_LOCATION (UPDATE)** — one statement per partition. Hive SQL cannot 
batch this (no multi-partition `SET LOCATION` syntax).
   
   So the win for JDBC is **pure parallel execution** of the existing batches 
and statements, not new batching.
   
   ## Hive 2.x quirk handling
   Same Hive 2.x `USE database` quirk we encountered in #18984: `ALTER 
PARTITION SET LOCATION` ignores `db.tbl` qualifiers and silently uses the 
connection's current database. Each pooled connection is its own HiveServer2 
session, so each one needs its own `USE database` before any partition ALTER. 
The pool exposes `runOnEachConnection(List<String>)` for this; 
`JDBCExecutor.runSQLs` peels off any leading `USE` statements from the SQL list 
and broadcasts them to every pooled connection, then fans the rest round-robin.
   
   This is done lazily on first dispatch (not at pool construction) because the 
database may not yet exist when the pool is built — `HoodieHiveSyncClient` 
constructs the pool *before* `HiveSyncTool.syncHoodieTable()` runs 
`createDatabase`. (I learned this the hard way during testing.)
   
   ## Design invariant (same as #18983 / #18984)
   - Pool clients only do partition-row statements.
   - The session Connection on `JDBCExecutor` continues to handle table-row 
work: `createDatabase`, `createTable`, `updateTableComments`, schema evolution.
   - `JDBCBasedMetadataOperator` (used as the Thrift-incompatibility fallback) 
continues to use the same session Connection — unchanged.
   
   ## Configs
   No new configs — reuses everything from #18983:
   | Key | Default |
   |---|---|
   | `hoodie.datasource.hive_sync.batching.enabled` | `false` |
   | `hoodie.datasource.hive_sync.batching.threads` | `4` |
   | `hoodie.datasource.hive_sync.batch_num` | `1000` |
   
   ## Test plan
   - [x] `mvn compile` on `hudi-sync/hudi-hive-sync` — clean, 0 Checkstyle 
violations, 0 RAT issues
   - [x] `mvn test` on `hudi-sync/hudi-hive-sync` — **314 tests, 0 failures, 0 
errors**
   - [x] `TestJDBCConnectionPool` (new, 8 tests) — borrow/return, 
concurrent-borrow bounding, idempotent close, exhaustion blocking, size 
validation, executor lifecycle
   - [x] `TestHiveSyncTool#testJDBCSyncWithBatchingEnabled` (new) — end-to-end 
JDBC sync with batching on against the embedded HiveServer2 (10 + 4 partitions, 
batch_num=3, threads=3)
   - [x] Existing 305 tests across all three sync modes pass unchanged
   - [ ] Manual benchmark on a ~1k-partition table (planned before flipping 
default; not blocking)
   
   ## Files touched (top commit only)
   - `HoodieHiveSyncClient.java` — build pool for JDBC mode (explicit + legacy 
`HIVE_USE_JDBC=true` default)
   - `JDBCExecutor.java` — new constructor accepting `JDBCConnectionPool`; 
`runSQLs` peels off `USE` (broadcasts to all connections) and dispatches the 
rest via pool
   - `util/JDBCConnectionPool.java` — **new**, ~210 lines; borrow/return queue 
+ executor sized to pool, plus `runOnEachConnection` broadcast helper
   - `util/TestJDBCConnectionPool.java` — **new**; 8 unit tests with mocked 
`Connection`
   - `TestHiveSyncTool.java` — 1 new end-to-end test method
   
   ## Stack so far
   - [#18983](https://github.com/apache/hudi/pull/18983) — HMS pool
   - [#18984](https://github.com/apache/hudi/pull/18984) — HiveQL pool (stacked 
on #18983)
   - **#18985** (this PR) — JDBC pool (stacked on #18984)
   
   After all three merge, the perf gap reported in #18331 should be closed for 
all sync modes.
   
   Related: #18331
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(hive-sync): batch and parallelize JDBC partition operations [hudi]

Reply via email to