singhpk234 opened a new pull request, #4479:
URL: https://github.com/apache/iceberg/pull/4479

   At present in MicroBatch stream we determine the latest offset by finding 
the latest snapshot and saying the all the changes till then will be part of 
our Microbatch, this can lead to un-even stream sizes.
   
https://github.com/apache/iceberg/blob/a6993dfc265b7be32ca5cc9774d3648de5cdaa6d/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L113-L114
   
   
   To handle this spark defined an interface `SupportsAdmissionControl` which 
let's us rateLimit basis on number of files and record count, If we implement 
this interface then our latestOffset will be derived from 
   
   ```java
   public Offset latestOffset(Offset startOffset, ReadLimit limit) {
     .....
   }
   ```
   
   
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L419-L533
   
   ---
   Testing 
   Just tested against `TestStructuredStreamingRead3` tests passes
   
   TODO :
   - refactor to increase re-usabliity 
   - Add more UT's 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to