[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2485: [HUDI-1109] Support Spark Structured Streaming read from Hudi table

GitBox Wed, 03 Feb 2021 18:49:15 -0800


pengzhiwei2018 edited a comment on pull request #2485:
URL: https://github.com/apache/hudi/pull/2485#issuecomment-772986092



   > @pengzhiwei2018 I am planning to spend sometime on this as well.
   > 
   > High level question. does the `offset` for the streaming read map to 
`_hoodie_commit_seq_no` in this implementation. This way we can actually do 
record level streams and even resume where we left off.
   
   Hi @vinothchandar , you are welcome to join this. 
   Currently the `HoodieSourceOffset` just keep the `commitTime` . And every 
minBatch we consume the incremental data between `(lastCommitTime, 
currentCommitTime]`   If it failed during the consuming, It will recovered from 
the offset state and continue  to consuming the data between  `(lastCommitTime, 
currentCommitTime]`. It is a commit level recovery.
   Introducing `_hoodie_commit_seq_no` to the `offset` may makes recovery more 
fine-grained to the record level. But the problem is how can we know the max 
`commit_seq_no` in the commit.  In the `getOffset` method, we must tell spark 
which `commit_seq_no` we will read to in the min batch. Currently in the hoodie 
meta data, we just record the commit time for each commit. So this is problem 
for slicing the offset.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2485: [HUDI-1109] Support Spark Structured Streaming read from Hudi table

Reply via email to