Re: [PR] KAFKA-15859: Make RemoteListOffsets call an async operation [kafka]

via GitHub Sat, 14 Sep 2024 08:51:26 -0700


chia7712 commented on code in PR #16602:
URL: https://github.com/apache/kafka/pull/16602#discussion_r1759757001



##########
core/src/main/scala/kafka/log/UnifiedLog.scala:
##########
@@ -1336,34 +1336,39 @@ class UnifiedLog(@volatile var logStartOffset: Long,
         val maxTimestampSoFar = 
latestTimestampSegment.readMaxTimestampAndOffsetSoFar
         // lookup the position of batch to avoid extra I/O
         val position = 
latestTimestampSegment.offsetIndex.lookup(maxTimestampSoFar.offset)
-        latestTimestampSegment.log.batchesFrom(position.position).asScala
+        val timestampAndOffsetOpt = 
latestTimestampSegment.log.batchesFrom(position.position).asScala
           .find(_.maxTimestamp() == maxTimestampSoFar.timestamp)
           .flatMap(batch => batch.offsetOfMaxTimestamp().asScala.map(new 
TimestampAndOffset(batch.maxTimestamp(), _,
             Optional.of[Integer](batch.partitionLeaderEpoch()).filter(_ >= 
0))))
+        OffsetResultHolder(timestampAndOffsetOpt)

Review Comment:
   > Even, if we pass the shallowOffsetOfMaxTimestampSoFar in the 
RemoteLogSegmentMetadata event, we have to download the remote-log-segment to 
find the real offsetOfMaxTimestampSoFar.
   
   IMHO, we don't need to pass `shallowOffsetOfMaxTimestampSoFar` to 
`RemoteLogSegmentMetadata` as `RemoteLogSegmentMetadata` has `maxTimestampMs` 
already, so the basic behavior is listed below (same as you described I'd say)
   1. find the max timestamp from local segments
   2. query `findOffsetByTimestamp` by max timestamp from step_1
   3. compare the timestamp of record from remote to local to pick up correct 
offset
   
   >  This will increase the load on remote storage assuming that the original 
intention of the KIP-734 is to Confirming topic/partition "liveness" which 
means the Admin client will repeatedly invoke the list-offsets API for 
MAX_TIMESTAMP.
   
   The impl of KIP-734 was wrong because we don't loop all records in all path 
(because of cost issue). Hence,  we rename the `offsetOfMaxTimestampSoFar` to 
`shallowOffsetOfMaxTimestampSoFar` based on true story.
   
   > Should we handle/drop the MAX_TIMESTAMP case for topics enabled with 
remote storage? This can cause high load:
   
   that is a acceptable approach. We can REJECT the MAX_TIMESTAMP request for 
now as it is rare operation. Or we can make the call an async op too as it 
needs to iterate all metadata of remote segments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] KAFKA-15859: Make RemoteListOffsets call an async operation [kafka]

Reply via email to