sivabalan narayanan created HUDI-6609:
-----------------------------------------

             Summary: Fix multi-writer with deltastreamer checkpointing
                 Key: HUDI-6609
                 URL: https://issues.apache.org/jira/browse/HUDI-6609
             Project: Apache Hudi
          Issue Type: Improvement
          Components: deltastreamer
            Reporter: sivabalan narayanan


As of now, we store checkpoints in commit metadata while writing via 
deltastreamer.
 
To support multiple writers (multiple deltastreamers), we added support 
sometime back where in the checkpoint will be a map to store multiple entries 
with key referring to writer identifier.
 
{ \{ "writer1" = "checkpointVal1"},
{ "writer2" = "checkpointVal2"} }
 
But this incurs some additional locking since everytime when new checkpoint 
needs to be updated, we have to reload the timeline and fetch the latest known 
commit metadata.
 
Instead we can de-couple the checkpoint.
Each writer only update its own checkpoint. and while parsing/fetching the 
latest known checkpoint for a writer, we might need to walk back in the 
timeline and find the right checkpoint value.
 
For eg:
commit1 by writer1: commit metadata ➝ \{writer1 = checkpointVal1}
commit2 by writer2: commit metadata ➝ \{writer2 = checkpointValA}
 
commit2 by writer1: To fetch latest checkpoint for writer1, we walk back the 
timeline and fetch the checkpoint of interest. So, even though latest commit 
metadata might have checkpoint, its key refers to writer2. And so we might need 
to go back and fetch the checkpoint from commit1.
and finally writer1 will update the commit metadata to \{writer1 = 
checkpointVal2}
 
 
btw, Please check when was the multiple checkpoint support was added. if it was 
added before 0.13.0, we need to ensure its backwards compatible as well. if 
not, we are good. Just fixing the exiting solution would suffice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to