Re: [I] [IMPROVEMENT] Hive Sync partition operations lack batching and parallelism, causing 4x-9x slowdown for large tables [hudi]

via GitHub Thu, 11 Jun 2026 21:04:55 -0700


nsivabalan commented on issue #18331:
URL: https://github.com/apache/hudi/issues/18331#issuecomment-4687309498


   Hi @prashantwason — posting an update on this. I've put up a stack of three 
PRs that implement the proposed fix, one per sync mode, all gated behind a 
single opt-in flag.
   
   ## The PRs
   
   | PR | Scope |
   |---|---|
   | [#18983](https://github.com/apache/hudi/pull/18983) | HMS executor — 
`IMetaStoreClientPool` |
   | [#18984](https://github.com/apache/hudi/pull/18984) | HiveQL executor — 
`HiveDriverPool` (stacked on #18983) |
   | [#18986](https://github.com/apache/hudi/pull/18986) | JDBC executor — 
`JDBCConnectionPool` (stacked on #18984) |
   
   All three reuse the same two configs (no per-mode flags):
   
   | Key | Default |
   |---|---|
   | `hoodie.datasource.hive_sync.batching.enabled` | `false` |
   | `hoodie.datasource.hive_sync.batching.threads` | `4` |
   | `hoodie.datasource.hive_sync.batch_num` | existing, default 1000 |
   
   ## Design summary
   
   The same invariant holds across all three modes: **the pool is 
partition-only**. Table-row operations (`createTable`, `updateTableDefinition`, 
`updateLastCommitTimeSynced`, `updateHoodieWriterVersion`, 
`updateTableComments`) continue to use the existing session client/connection. 
The sync flow is `serial → parallel → serial`:
   
   1. Table setup (single client)
   2. Partition fan-out across the pool
   3. Table finalization (single client)
   
   This eliminates lost-update risk on `Table.parameters` (the 
read-modify-write pattern in `updateLastCommitTimeSynced` etc.).
   
   Per-mode implementation differences:
   
   - **HMS**: borrow/return pool of `RetryingMetaStoreClient.getProxy(...)` 
instances. Thrift client is not thread-safe but trivially poolable. Also adds 
the missing batching to TOUCH/UPDATE (`alter_partitions`) and DROP 
(`dropPartition`) — those were missing batch boundaries in the original code.
   - **HiveQL**: more involved — `Driver`/`SessionState` are *thread-bound*, 
not just non-thread-safe. The pool gives each slot its own dedicated worker 
thread; the `Driver` is constructed on that thread, lives there, and dies 
there. `SessionState` is shared across workers but `SessionState.start()` is 
called on each thread to attach to its ThreadLocal.
   - **JDBC**: simplest of the three — `Connection` is not thread-safe but 
cheap to open, so back to a plain borrow/return pool like HMS.
   
   ## One detail worth flagging from the implementation
   
   The original issue's root-cause #1 mentions JDBC's DROP as un-batched, but I 
think that's the one inaccuracy — `JDBCExecutor.constructDropPartitions` 
already batches by `HIVE_BATCH_SYNC_PARTITION_NUM`. The real JDBC batching gap 
is **UPDATE / `SET LOCATION`**, which Hive SQL fundamentally can't batch (no 
multi-partition syntax). For JDBC, the gain is therefore purely from parallel 
execution of the per-partition `SET LOCATION` statements, not from new batching.
   
   There's also a Hive 2.x quirk worth knowing about: `ALTER PARTITION ... SET 
LOCATION` ignores `db.tbl` qualifiers and silently uses the connection's 
current database. The HiveQL and JDBC pools handle this by broadcasting the 
leading `USE database` statement to every pooled worker before fanning out the 
partition statements. Took me a test cycle to figure out, so flagging it for 
future maintainers.
   
   ## What's done and what's not
   
   Done in the PRs:
   - All four operations (ADD/UPDATE/TOUCH/DROP) consistently batched by 
`HIVE_BATCH_SYNC_PARTITION_NUM` across all three executors
   - Parallel fan-out for all three modes
   - Default off, no behavior change for existing users
   - Full test suite (296 → 314 tests, all green)
   - New unit tests for each pool class (mocked clients)
   - New end-to-end integration tests for each mode (batching on, against 
embedded HMS)
   
   **Not done, and where I'd really like your help:**
   - **Production-scale benchmark.** The PR descriptions all include a TODO for 
this. The embedded HMS that the test suite uses isn't representative — relative 
speedups on a laptop don't translate cleanly to a production HMS with real 
network latency. If you (or anyone watching this issue) can run a sync against 
a 1.5k–2k partition table with `batching.enabled=false` vs `true` and post 
numbers, that would be the missing piece before any default-on consideration.
   
   Happy to iterate on any of this. The PRs are independently reviewable in 
stack order (#18983 first), and each diff is self-contained in its top commit.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [IMPROVEMENT] Hive Sync partition operations lack batching and parallelism, causing 4x-9x slowdown for large tables [hudi]

Reply via email to