singhpk234 opened a new pull request, #4479: URL: https://github.com/apache/iceberg/pull/4479
At present in MicroBatch stream we determine the latest offset by finding the latest snapshot and saying the all the changes till then will be part of our Microbatch, this can lead to un-even stream sizes. https://github.com/apache/iceberg/blob/a6993dfc265b7be32ca5cc9774d3648de5cdaa6d/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L113-L114 To handle this spark defined an interface `SupportsAdmissionControl` which let's us rateLimit basis on number of files and record count, If we implement this interface then our latestOffset will be derived from ```java public Offset latestOffset(Offset startOffset, ReadLimit limit) { ..... } ``` https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L419-L533 --- Testing Just tested against `TestStructuredStreamingRead3` tests passes TODO : - refactor to increase re-usabliity - Add more UT's -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
