[GitHub] [spark] HeartSaVioR commented on pull request #32747: [SPARK-35611][SS] Introduce the strategy on mismatched offset for start offset timestamp on Kafka data source

GitBox Wed, 02 Jun 2021 15:25:35 -0700


HeartSaVioR commented on pull request #32747:
URL: https://github.com/apache/spark/pull/32747#issuecomment-853422778



   > Let me know if I understand correctly or not. So it sounds like when start 
offset timestamp cannot be found on Kafka, Spark will turn to read latest 
offset instead. I think it makes sense there is an option for end users to 
avoid query failure in that case. Just wondering, is latest offset the best 
option? If the start offset timestamp is far before latest offset, does it 
still make sense to retrieve latest offset?
   > 
   > For example, partition 1 is able to get record with timestamp 1, but 
partition 2 returns unmatched offset. With this option, we retrieve latest 
offset from partition 2 instead. But the latest offset could be with timestamp 
1000, though there are some records after timestamp 1 before timestamp 1000?
   
   Yeah that is quite tricky to understand - the fact is, the functionality of 
"offset by timestamp" never be able to guarantee that there should be NO 
records in further batches less than specified timestamp. IIUC there's no 
ordering requirement in Kafka - even the default behavior of Kafka doesn't 
guarantee such ordering; by default, the timestamp is marked from producer 
side, which would suffer with the known clock issue. Even you can manually set 
the timestamp in record, which means that the timestamp can be the semantic of 
"event time", and now you can't predict about the ordering at all.
   
   That's why we had to explain in doc that we are simply calling Kafka's 
offset by timestamp and leveraging the result. Now you'll realize the fact that 
the problem exists even if the matched offset is found. The functionality of 
"offset by timestamp" should be used with understanding of this.
   
   So there's no best option works for every cases, as even the normal case can 
be broken if you consider the semantic of "offset by timestamp" as no records 
will be read which have less than the specified timestamp. The data itself 
should respect the ordering, or end users should tolerate some missing records.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #32747: [SPARK-35611][SS] Introduce the strategy on mismatched offset for start offset timestamp on Kafka data source

Reply via email to