nsivabalan opened a new pull request, #18986: URL: https://github.com/apache/hudi/pull/18986
## Summary - Adds opt-in `JDBCConnectionPool` for JDBC partition sync (issue [#18331](https://github.com/apache/hudi/issues/18331)) - Fans batched ALTER statements across N pooled `java.sql.Connection` instances - Default off — existing behavior unchanged unless `hoodie.datasource.hive_sync.batching.enabled=true` **Note: stacked on [#18984](https://github.com/apache/hudi/pull/18984) (which is stacked on [#18983](https://github.com/apache/hudi/pull/18983)).** The diff currently includes both upstream commits. Once #18983 and #18984 merge, this PR will rebase cleanly onto master and the diff will shrink to just the JDBC change (~581 added / 3 removed). Reviewing the top commit `f2b079109192` in isolation gives the JDBC-only delta. ## Design constraint JDBC `Connection` is not thread-safe but is *cheap to construct* (one TCP socket to HiveServer2). So the design mirrors [#18983](https://github.com/apache/hudi/pull/18983)'s `IMetaStoreClientPool` more than #18984's `HiveDriverPool` — simple borrow/return with no thread-binding. Each pooled connection is its own HiveServer2 session. ## What's actually new The batching gaps for JDBC were small: - **ADD** — already batched in `QueryBasedDDLExecutor.constructAddPartitions` (parent class). - **DROP** — already batched in `JDBCExecutor.constructDropPartitions`. - **TOUCH** — became batched in #18984's `constructPartitionAlterStatements` change. - **SET_LOCATION (UPDATE)** — one statement per partition. Hive SQL cannot batch this (no multi-partition `SET LOCATION` syntax). So the win for JDBC is **pure parallel execution** of the existing batches and statements, not new batching. ## Hive 2.x quirk handling Same Hive 2.x `USE database` quirk we encountered in #18984: `ALTER PARTITION SET LOCATION` ignores `db.tbl` qualifiers and silently uses the connection's current database. Each pooled connection is its own HiveServer2 session, so each one needs its own `USE database` before any partition ALTER. The pool exposes `runOnEachConnection(List<String>)` for this; `JDBCExecutor.runSQLs` peels off any leading `USE` statements from the SQL list and broadcasts them to every pooled connection, then fans the rest round-robin. This is done lazily on first dispatch (not at pool construction) because the database may not yet exist when the pool is built — `HoodieHiveSyncClient` constructs the pool *before* `HiveSyncTool.syncHoodieTable()` runs `createDatabase`. (I learned this the hard way during testing.) ## Design invariant (same as #18983 / #18984) - Pool clients only do partition-row statements. - The session Connection on `JDBCExecutor` continues to handle table-row work: `createDatabase`, `createTable`, `updateTableComments`, schema evolution. - `JDBCBasedMetadataOperator` (used as the Thrift-incompatibility fallback) continues to use the same session Connection — unchanged. ## Configs No new configs — reuses everything from #18983: | Key | Default | |---|---| | `hoodie.datasource.hive_sync.batching.enabled` | `false` | | `hoodie.datasource.hive_sync.batching.threads` | `4` | | `hoodie.datasource.hive_sync.batch_num` | `1000` | ## Test plan - [x] `mvn compile` on `hudi-sync/hudi-hive-sync` — clean, 0 Checkstyle violations, 0 RAT issues - [x] `mvn test` on `hudi-sync/hudi-hive-sync` — **314 tests, 0 failures, 0 errors** - [x] `TestJDBCConnectionPool` (new, 8 tests) — borrow/return, concurrent-borrow bounding, idempotent close, exhaustion blocking, size validation, executor lifecycle - [x] `TestHiveSyncTool#testJDBCSyncWithBatchingEnabled` (new) — end-to-end JDBC sync with batching on against the embedded HiveServer2 (10 + 4 partitions, batch_num=3, threads=3) - [x] Existing 305 tests across all three sync modes pass unchanged - [ ] Manual benchmark on a ~1k-partition table (planned before flipping default; not blocking) ## Files touched (top commit only) - `HoodieHiveSyncClient.java` — build pool for JDBC mode (explicit + legacy `HIVE_USE_JDBC=true` default) - `JDBCExecutor.java` — new constructor accepting `JDBCConnectionPool`; `runSQLs` peels off `USE` (broadcasts to all connections) and dispatches the rest via pool - `util/JDBCConnectionPool.java` — **new**, ~210 lines; borrow/return queue + executor sized to pool, plus `runOnEachConnection` broadcast helper - `util/TestJDBCConnectionPool.java` — **new**; 8 unit tests with mocked `Connection` - `TestHiveSyncTool.java` — 1 new end-to-end test method ## Stack so far - [#18983](https://github.com/apache/hudi/pull/18983) — HMS pool - [#18984](https://github.com/apache/hudi/pull/18984) — HiveQL pool (stacked on #18983) - **#18985** (this PR) — JDBC pool (stacked on #18984) After all three merge, the perf gap reported in #18331 should be closed for all sync modes. Related: #18331 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
