hudi-bot opened a new issue, #16136:
URL: https://github.com/apache/hudi/issues/16136
As of now, we store checkpoints in commit metadata while writing via
deltastreamer.
To support multiple writers (multiple deltastreamers), we added support
sometime back where in the checkpoint will be a map to store multiple entries
with key referring to writer identifier.
{ \{ "writer1" = "checkpointVal1"},
{ "writer2" = "checkpointVal2"} }
But this incurs some additional locking since everytime when new checkpoint
needs to be updated, we have to reload the timeline and fetch the latest known
commit metadata.
Instead we can de-couple the checkpoint.
Each writer only update its own checkpoint. and while parsing/fetching the
latest known checkpoint for a writer, we might need to walk back in the
timeline and find the right checkpoint value.
For eg:
commit1 by writer1: commit metadata ➝ \{writer1 = checkpointVal1}
commit2 by writer2: commit metadata ➝ \{writer2 = checkpointValA}
commit2 by writer1: To fetch latest checkpoint for writer1, we walk back the
timeline and fetch the checkpoint of interest. So, even though latest commit
metadata might have checkpoint, its key refers to writer2. And so we might need
to go back and fetch the checkpoint from commit1.
and finally writer1 will update the commit metadata to \{writer1 =
checkpointVal2}
btw, Please check when was the multiple checkpoint support was added. if it
was added before 0.13.0, we need to ensure its backwards compatible as well. if
not, we are good. Just fixing the exiting solution would suffice.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-6609
- Type: Improvement
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]