pengzhiwei2018 edited a comment on pull request #2485: URL: https://github.com/apache/hudi/pull/2485#issuecomment-772986092
> @pengzhiwei2018 I am planning to spend sometime on this as well. > > High level question. does the `offset` for the streaming read map to `_hoodie_commit_seq_no` in this implementation. This way we can actually do record level streams and even resume where we left off. Hi @vinothchandar , you are welcome to join this. Currently the `HoodieSourceOffset` just keep the `commitTime` . And every minBatch we consume the incremental data between `(lastCommitTime, currentCommitTime]` If it failed during the consuming, It will recovered from the offset state and continue to consuming the data between `(lastCommitTime, currentCommitTime]`. It is a commit level recovery. Introducing `_hoodie_commit_seq_no` to the `offset` may makes recovery more fine-grained to the record level. But the problem is how can we know the max `commit_seq_no` in the commit. In the `getOffset` method, we must tell spark which `commit_seq_no` we will read to in the min batch. Currently in the hoodie meta data, we just record the commit time for each commit. So this is problem for slicing the offset. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
