vrajat opened a new pull request, #12608:
URL: https://github.com/apache/pinot/pull/12608

   Pinot may take multiple hours between polling a partition in a Kafka topic. 
One specific example is that Pinot took a long time to flush a segment to disk. 
In the meantime, messages in Kafka can expire if message retention time is 
small.
   If `auto.offset.reset` is set to smallest, then Kafka will silently move the 
offset to the first available message leading to data loss.
   RealtimeSegmentValidationManager is a cron that runs every hour and detects 
where the offset of a segment in zookeeper is in the past when compared to the 
smallest offset in Kafka. However since it runs every hour, it may miss the 
data loss if it happens between runs.
   
   This commit compares the startOffset to the batchFirstOffset. 
   * startOffset: Offset requested by the database for the next batch.
   * batchFirstOffset: First offset of the batch of messages received from the 
stream. 
   If startOffset < batchFirstOffset, then log the condition as well as set a 
meter to 1.
   
   This test is implemented only for Kafka Streams.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to