shangxinli opened a new issue, #18750: URL: https://github.com/apache/hudi/issues/18750
## Background Follow-up to #18405 ([Danny's review comment](https://github.com/apache/hudi/pull/18405#discussion_r3214248837)). #18405 introduced a Spark/HoodieStreamer pre-commit validator (`SparkKafkaOffsetValidator`) built on the framework from #18068. We now have two parallel write-error / write-status validation paths in HoodieStreamer: 1. The new `BasePreCommitValidator` framework (#18068, #18405) — runs *before* `writeClient.commit()`, parameterized by `TypedProperties`, state exposed via `ValidationContext`. 2. The existing `HoodieStreamerWriteStatusValidator` (`StreamSync.java:1403`) — runs *inside* `writeClient.commit()` via the `WriteStatusValidator` callback hook. Danny suggested consolidating these by migrating (2) into the (1) framework. This issue tracks that work. ## Why it isn't a trivial migration `HoodieStreamerWriteStatusValidator` does substantially more than pure validation: - Counts records/errors and writes to a shared `AtomicLong totalSuccessfulRecords` consumed *after* commit (drives the `runMetaSync()` decision). - **Commits the error table** as a side effect, with `ROLLBACK_COMMIT` / `LOG_ERROR` strategies. - Rolls back the instant on validation failure; logs the top-100 errors. - Runs *inside* `writeClient.commit()`, not before it. The current `BasePreCommitValidator` / `ValidationContext` interface has no path to: - Surface the raw `JavaRDD<WriteStatus>` (needed for error-table commit). - Inject the error-table writer. - Expose a post-commit counter back to `StreamSync`. ## Proposed approach (to discuss) Option A — **Extend `ValidationContext`**: add accessors for `JavaRDD<WriteStatus>` and an error-table writer handle; keep side effects inside the validator. Option B — **Split side-effects out**: keep validation pure (just succeed/fail with reasons) and move error-table commit + counter wiring into a separate post-validation step in `StreamSync`. Leaning toward Option B because it keeps the validator framework clean and matches the "validate, don't mutate" contract of `BasePreCommitValidator`. Open to feedback. ## Acceptance criteria - [ ] Decision on Option A vs. B (or alternative). - [ ] `HoodieStreamerWriteStatusValidator` removed; equivalent behavior implemented as a pre-commit validator. - [ ] Error-table commit + `totalSuccessfulRecords` wiring preserved (no regression in `runMetaSync` decision). - [ ] Tests covering `ROLLBACK_COMMIT` and `LOG_ERROR` error policies. - [ ] No behavior change for users who don't configure `hoodie.precommit.validators`. cc @danny0405 @codope -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
