shangxinli opened a new pull request, #18124:
URL: https://github.com/apache/hudi/pull/18124

   This commit adds utility functions to StreamerUtil for extracting and 
parsing Kafka offset metadata from Hudi commits, enabling analysis of message 
processing progress between commits.
   
   Changes:
   - Add extractKafkaOffsetMetadata() to extract Kafka offset metadata from 
commit
   - Add parseKafkaOffsets() to parse URL-encoded Kafka offset strings into 
partition->offset map
   - Add calculateKafkaOffsetDifference() to compute total offset delta between 
two commits
   - Add comprehensive unit tests in TestStreamerUtil covering normal cases and 
edge cases
   - Add constants: HOODIE_METADATA_KEY, URL_ENCODED_COLON, PARTITION_SEPARATOR
   
   Use Cases:
   - Data loss detection by comparing expected vs actual record counts
   - Monitoring Kafka consumption progress across Hudi commits
   - Debugging offset tracking issues in Flink-Kafka-Hudi pipelines
   
   Test Coverage:
   - Metadata extraction from commits
   - Parsing various Kafka offset string formats
   - Handling new partitions, malformed entries, missing metadata
   - Calculating offset differences with edge cases
   
   Original Internal Commit (for lineage):
   bb19ca6cbe - Add helper function to parse the Kafka offsets diff range 
between two commits Differential: https://code.uberinternal.com/D19046249 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to