Re: [PR] [SPARK-54027] Kafka Source RTM support [spark]

via GitHub Fri, 31 Oct 2025 12:26:00 -0700


jerrypeng commented on code in PR #52729:
URL: https://github.com/apache/spark/pull/52729#discussion_r2482466318



##########
connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala:
##########
@@ -235,7 +345,30 @@ private[kafka010] class KafkaMicroBatchStream(
   override def toString(): String = s"KafkaV2[$kafkaOffsetReader]"
 
   override def metrics(latestConsumedOffset: Optional[Offset]): ju.Map[String, 
String] = {
-    KafkaMicroBatchStream.metrics(latestConsumedOffset, latestPartitionOffsets)
+    var rtmFetchLatestOffsetsTimeMs = Option.empty[Long]
+    val reCalculatedLatestPartitionOffsets =
+      if (inRealTimeMode) {
+        if (!latestConsumedOffset.isPresent) {
+          // this means a batch has no end offsets, which should not happen
+          None
+        } else {
+          Some {
+            val startTime = System.currentTimeMillis()
+            val latestOffsets = kafkaOffsetReader.fetchLatestOffsets(
+              
Some(latestConsumedOffset.get.asInstanceOf[KafkaSourceOffset].partitionToOffsets))
+            val endTime = System.currentTimeMillis()
+            rtmFetchLatestOffsetsTimeMs = Some(endTime - startTime)
+            latestOffsets
+          }
+        }
+      } else {
+        // If we are in micro-batch mode, we need to get the latest partition 
offsets at the
+        // start of the batch and recalculate the latest offsets at the end 
for backlog
+        // estimation.
+        
Some(kafkaOffsetReader.fetchLatestOffsets(Some(latestPartitionOffsets)))

Review Comment:
   This is actually fixing an issue with non-rtm queries using kafka.  The 
calculation is is not correct here and will always result in the backlog 
metrics being zero.  "latestPartitionOffsets" is calculated at when 
"latestOffset" is called at the beginning of a batch. It is basically the 
offset this batch will read up to so for non-rtm streaming queries 
latestConsumedOffset will be the same as latestPartitionOffsets resulting in 
zero backlog.  What we should be doing is get the latest offsets from source 
kafka topic after the batch is processed i.e. when metrics() is called to 
calculate a useful backlog metric.  I know this is not really related to RTM so 
let me know if I should just create a separate PR for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54027] Kafka Source RTM support [spark]

Reply via email to