shangxinli opened a new pull request, #18068:
URL: https://github.com/apache/hudi/pull/18068
Implement engine-agnostic pre-commit validation framework in hudi-common
that enables validators to access commit metadata, timeline, and write
statistics across all write engines (Spark, Flink, Java).
This is Phase 1 of a 3-phase implementation:
- Phase 1 (this commit): Core framework in hudi-common
- Phase 2: Flink-specific implementation
- Phase 3: Spark/DeltaStreamer implementation
Key components added:
1. BasePreCommitValidator
- Abstract base class for all validators
- Supports metadata-based validation
- Engine-agnostic design
2. ValidationContext (interface)
- Provides access to commit metadata, timeline, write stats
- Engine-specific implementations provide concrete access
- Abstracts engine details from validation logic
3. StreamingOffsetValidator
- Base class for streaming offset validators
- Compares source offset differences with record counts
- Configurable tolerance and warn-only mode
- Supports multiple checkpoint formats (Kafka, Flink, Pulsar, Kinesis)
4. CheckpointUtils
- Multi-format checkpoint parsing utility
- Supports DeltaStreamer Kafka format (Phase 1)
- Extensible for Flink, Pulsar, Kinesis (future phases)
- Offset difference calculation with edge case handling
5. Comprehensive unit tests
- TestCheckpointUtils: 14 test cases
- TestStreamingOffsetValidator: 9 test cases
Configuration:
- hoodie.precommit.validators.streaming.offset.tolerance.percentage
(default: 0.0)
- hoodie.precommit.validators.warn.only (default: false)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]