[
https://issues.apache.org/jira/browse/HUDI-6609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-6609:
---------------------------------
Labels: pull-request-available (was: )
> Fix multi-writer with deltastreamer checkpointing
> -------------------------------------------------
>
> Key: HUDI-6609
> URL: https://issues.apache.org/jira/browse/HUDI-6609
> Project: Apache Hudi
> Issue Type: Improvement
> Components: deltastreamer
> Reporter: sivabalan narayanan
> Priority: Major
> Labels: pull-request-available
>
> As of now, we store checkpoints in commit metadata while writing via
> deltastreamer.
>
> To support multiple writers (multiple deltastreamers), we added support
> sometime back where in the checkpoint will be a map to store multiple entries
> with key referring to writer identifier.
>
> { \{ "writer1" = "checkpointVal1"},
> { "writer2" = "checkpointVal2"} }
>
> But this incurs some additional locking since everytime when new checkpoint
> needs to be updated, we have to reload the timeline and fetch the latest
> known commit metadata.
>
> Instead we can de-couple the checkpoint.
> Each writer only update its own checkpoint. and while parsing/fetching the
> latest known checkpoint for a writer, we might need to walk back in the
> timeline and find the right checkpoint value.
>
> For eg:
> commit1 by writer1: commit metadata ➝ \{writer1 = checkpointVal1}
> commit2 by writer2: commit metadata ➝ \{writer2 = checkpointValA}
>
> commit2 by writer1: To fetch latest checkpoint for writer1, we walk back the
> timeline and fetch the checkpoint of interest. So, even though latest commit
> metadata might have checkpoint, its key refers to writer2. And so we might
> need to go back and fetch the checkpoint from commit1.
> and finally writer1 will update the commit metadata to \{writer1 =
> checkpointVal2}
>
>
> btw, Please check when was the multiple checkpoint support was added. if it
> was added before 0.13.0, we need to ensure its backwards compatible as well.
> if not, we are good. Just fixing the exiting solution would suffice.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)