[ 
https://issues.apache.org/jira/browse/HUDI-6609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6609:
---------------------------------
    Labels: pull-request-available  (was: )

> Fix multi-writer with deltastreamer checkpointing
> -------------------------------------------------
>
>                 Key: HUDI-6609
>                 URL: https://issues.apache.org/jira/browse/HUDI-6609
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: deltastreamer
>            Reporter: sivabalan narayanan
>            Priority: Major
>              Labels: pull-request-available
>
> As of now, we store checkpoints in commit metadata while writing via 
> deltastreamer.
>  
> To support multiple writers (multiple deltastreamers), we added support 
> sometime back where in the checkpoint will be a map to store multiple entries 
> with key referring to writer identifier.
>  
> { \{ "writer1" = "checkpointVal1"},
> { "writer2" = "checkpointVal2"} }
>  
> But this incurs some additional locking since everytime when new checkpoint 
> needs to be updated, we have to reload the timeline and fetch the latest 
> known commit metadata.
>  
> Instead we can de-couple the checkpoint.
> Each writer only update its own checkpoint. and while parsing/fetching the 
> latest known checkpoint for a writer, we might need to walk back in the 
> timeline and find the right checkpoint value.
>  
> For eg:
> commit1 by writer1: commit metadata ➝ \{writer1 = checkpointVal1}
> commit2 by writer2: commit metadata ➝ \{writer2 = checkpointValA}
>  
> commit2 by writer1: To fetch latest checkpoint for writer1, we walk back the 
> timeline and fetch the checkpoint of interest. So, even though latest commit 
> metadata might have checkpoint, its key refers to writer2. And so we might 
> need to go back and fetch the checkpoint from commit1.
> and finally writer1 will update the commit metadata to \{writer1 = 
> checkpointVal2}
>  
>  
> btw, Please check when was the multiple checkpoint support was added. if it 
> was added before 0.13.0, we need to ensure its backwards compatible as well. 
> if not, we are good. Just fixing the exiting solution would suffice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to