nsivabalan commented on issue #18331: URL: https://github.com/apache/hudi/issues/18331#issuecomment-4687309498
Hi @prashantwason — posting an update on this. I've put up a stack of three PRs that implement the proposed fix, one per sync mode, all gated behind a single opt-in flag. ## The PRs | PR | Scope | |---|---| | [#18983](https://github.com/apache/hudi/pull/18983) | HMS executor — `IMetaStoreClientPool` | | [#18984](https://github.com/apache/hudi/pull/18984) | HiveQL executor — `HiveDriverPool` (stacked on #18983) | | [#18986](https://github.com/apache/hudi/pull/18986) | JDBC executor — `JDBCConnectionPool` (stacked on #18984) | All three reuse the same two configs (no per-mode flags): | Key | Default | |---|---| | `hoodie.datasource.hive_sync.batching.enabled` | `false` | | `hoodie.datasource.hive_sync.batching.threads` | `4` | | `hoodie.datasource.hive_sync.batch_num` | existing, default 1000 | ## Design summary The same invariant holds across all three modes: **the pool is partition-only**. Table-row operations (`createTable`, `updateTableDefinition`, `updateLastCommitTimeSynced`, `updateHoodieWriterVersion`, `updateTableComments`) continue to use the existing session client/connection. The sync flow is `serial → parallel → serial`: 1. Table setup (single client) 2. Partition fan-out across the pool 3. Table finalization (single client) This eliminates lost-update risk on `Table.parameters` (the read-modify-write pattern in `updateLastCommitTimeSynced` etc.). Per-mode implementation differences: - **HMS**: borrow/return pool of `RetryingMetaStoreClient.getProxy(...)` instances. Thrift client is not thread-safe but trivially poolable. Also adds the missing batching to TOUCH/UPDATE (`alter_partitions`) and DROP (`dropPartition`) — those were missing batch boundaries in the original code. - **HiveQL**: more involved — `Driver`/`SessionState` are *thread-bound*, not just non-thread-safe. The pool gives each slot its own dedicated worker thread; the `Driver` is constructed on that thread, lives there, and dies there. `SessionState` is shared across workers but `SessionState.start()` is called on each thread to attach to its ThreadLocal. - **JDBC**: simplest of the three — `Connection` is not thread-safe but cheap to open, so back to a plain borrow/return pool like HMS. ## One detail worth flagging from the implementation The original issue's root-cause #1 mentions JDBC's DROP as un-batched, but I think that's the one inaccuracy — `JDBCExecutor.constructDropPartitions` already batches by `HIVE_BATCH_SYNC_PARTITION_NUM`. The real JDBC batching gap is **UPDATE / `SET LOCATION`**, which Hive SQL fundamentally can't batch (no multi-partition syntax). For JDBC, the gain is therefore purely from parallel execution of the per-partition `SET LOCATION` statements, not from new batching. There's also a Hive 2.x quirk worth knowing about: `ALTER PARTITION ... SET LOCATION` ignores `db.tbl` qualifiers and silently uses the connection's current database. The HiveQL and JDBC pools handle this by broadcasting the leading `USE database` statement to every pooled worker before fanning out the partition statements. Took me a test cycle to figure out, so flagging it for future maintainers. ## What's done and what's not Done in the PRs: - All four operations (ADD/UPDATE/TOUCH/DROP) consistently batched by `HIVE_BATCH_SYNC_PARTITION_NUM` across all three executors - Parallel fan-out for all three modes - Default off, no behavior change for existing users - Full test suite (296 → 314 tests, all green) - New unit tests for each pool class (mocked clients) - New end-to-end integration tests for each mode (batching on, against embedded HMS) **Not done, and where I'd really like your help:** - **Production-scale benchmark.** The PR descriptions all include a TODO for this. The embedded HMS that the test suite uses isn't representative — relative speedups on a laptop don't translate cleanly to a production HMS with real network latency. If you (or anyone watching this issue) can run a sync against a 1.5k–2k partition table with `batching.enabled=false` vs `true` and post numbers, that would be the missing piece before any default-on consideration. Happy to iterate on any of this. The PRs are independently reviewable in stack order (#18983 first), and each diff is self-contained in its top commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
