shangxinli opened a new pull request, #18764:
URL: https://github.com/apache/hudi/pull/18764

   ### Describe the issue this Pull Request addresses
   
   Phase 1 of #18750 — migrating `HoodieStreamerWriteStatusValidator` (HSWSV) 
into the pre-commit validator framework (#18068, #18362, #18405).
   
   This is the first of 4 phases. It is purely additive: it adds a new opt-in 
validator and changes nothing about the existing HSWSV path.
   
   ### Summary and Changelog
   
   **Summary:**
   Adds `SparkWriteErrorValidator`, a pure pre-commit validator that fails the 
commit when any records failed to write. This is a behavior-equivalent extract 
of HSWSV's boolean error check (`hasErrorRecords = totalErroredRecords > 0`) 
without any of HSWSV's side effects — no error-table commit, no top-100 error 
logging, no instant rollback. Those concerns will be extracted in subsequent 
phases.
   
   Behavior mapping from HSWSV:
   - `cfg.commitOnErrors = false` (default) ↔ 
`hoodie.precommit.validators.failure.policy = FAIL`
   - `cfg.commitOnErrors = true` ↔ `hoodie.precommit.validators.failure.policy 
= WARN_LOG`
   
   **Stack context:**
   This PR builds on Phase 3 (#18405), which introduced the 
Spark/HoodieStreamer pre-commit validator wiring in 
`StreamSync.writeToSinkAndDoMetaSync()`. No changes to `StreamSync` are needed 
in Phase 1 — the validator plugs into the existing 
`SparkStreamerValidatorUtils.runValidators()` call when the user opts in via 
`hoodie.precommit.validators`.
   
   **Changelog:**
   - Added `SparkWriteErrorValidator extends BasePreCommitValidator` in 
`org.apache.hudi.utilities.streamer.validator`
   - Reads `getTotalWriteErrors()` and `getTotalRecordsWritten()` from 
`ValidationContext` (no new context methods needed)
   - Reuses existing `hoodie.precommit.validators.failure.policy` (FAIL / 
WARN_LOG); no new config introduced
   - Added `TestSparkWriteErrorValidator` with 8 unit tests
   - All code is new, no existing code was copied
   
   **Phased rollout (tracked in #18750):**
   - Phase 1 (this PR): Additive `SparkWriteErrorValidator`. HSWSV unchanged.
   - Phase 2: Carve out HSWSV's side effects into `ErrorTableCommitter`, 
`WriteErrorReporter`, `SuccessfulRecordCounter` helpers. HSWSV becomes a thin 
coordinator.
   - Phase 3: Flip the call site in `StreamSync` from the 
`WriteStatusValidator` callback to the pre-commit framework.
   - Phase 4: Delete HSWSV. Remove the `WriteStatusValidator` hook from the 
write client if no other caller exists.
   
   ### Impact
   
   **Public API Changes:**
   - New public class 
`org.apache.hudi.utilities.streamer.validator.SparkWriteErrorValidator`
   
   **User-Facing Changes:**
   Users can now enable the framework-based write-error check in HoodieStreamer 
pipelines by configuring:
   ```
   
hoodie.precommit.validators=org.apache.hudi.utilities.streamer.validator.SparkWriteErrorValidator
   hoodie.precommit.validators.failure.policy=FAIL   # or WARN_LOG to allow 
commits with errors
   ```
   
   This runs *alongside* HSWSV in Phase 1, not in place of it. HSWSV remains 
the canonical path that commits the error table and rolls back on failure. 
Enabling this validator adds a pre-commit guard but does not remove or replace 
any existing behavior.
   
   **Performance Impact:**
   None for users who do not configure `hoodie.precommit.validators`. When 
configured, the validator runs once per commit and only inspects 
already-collected `HoodieWriteStat` aggregates — no additional Spark actions or 
DAG evaluations.
   
   ### Risk Level
   
   **Risk Level: low**
   
   **Justification:**
   - Purely additive — no existing code path is modified
   - Validator is opt-in; default behavior is unchanged
   - Reuses battle-tested framework components (`BasePreCommitValidator`, 
`ValidationContext`, `SparkStreamerValidatorUtils` from Phase 3)
   - 8 unit tests covering all branches (no errors, FAIL/WARN_LOG, empty 
commit, missing write stats, multi-partition summing, update records, default 
policy)
   
   **Verification:**
   - `mvn -pl hudi-utilities -am test-compile`: BUILD SUCCESS
   - `mvn -pl hudi-utilities test -Dtest=TestSparkWriteErrorValidator`: 8/8 pass
   - `mvn -pl hudi-utilities test 
-Dtest=TestSparkKafkaOffsetValidator,TestSparkValidationContext,TestSparkStreamerValidatorUtils,TestSparkWriteErrorValidator`:
 44/44 pass (no regression in other validator tests)
   - `mvn -pl hudi-utilities checkstyle:check`: 0 violations
   - `mvn -pl hudi-utilities apache-rat:check`: 0 unapproved licenses
   
   ### Documentation Update
   
   No user-facing documentation update is needed in Phase 1 — the validator is 
opt-in and the existing configuration documentation in 
`HoodiePreCommitValidatorConfig` already covers the failure-policy property. 
The `VALIDATOR_CLASS_NAMES` documentation will be updated in Phase 4 (cleanup) 
to reference `SparkWriteErrorValidator` once HSWSV is removed and this becomes 
the canonical write-error path.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to