sivabalan narayanan created HUDI-6609:
-----------------------------------------
Summary: Fix multi-writer with deltastreamer checkpointing
Key: HUDI-6609
URL: https://issues.apache.org/jira/browse/HUDI-6609
Project: Apache Hudi
Issue Type: Improvement
Components: deltastreamer
Reporter: sivabalan narayanan
As of now, we store checkpoints in commit metadata while writing via
deltastreamer.
To support multiple writers (multiple deltastreamers), we added support
sometime back where in the checkpoint will be a map to store multiple entries
with key referring to writer identifier.
{ \{ "writer1" = "checkpointVal1"},
{ "writer2" = "checkpointVal2"} }
But this incurs some additional locking since everytime when new checkpoint
needs to be updated, we have to reload the timeline and fetch the latest known
commit metadata.
Instead we can de-couple the checkpoint.
Each writer only update its own checkpoint. and while parsing/fetching the
latest known checkpoint for a writer, we might need to walk back in the
timeline and find the right checkpoint value.
For eg:
commit1 by writer1: commit metadata ➝ \{writer1 = checkpointVal1}
commit2 by writer2: commit metadata ➝ \{writer2 = checkpointValA}
commit2 by writer1: To fetch latest checkpoint for writer1, we walk back the
timeline and fetch the checkpoint of interest. So, even though latest commit
metadata might have checkpoint, its key refers to writer2. And so we might need
to go back and fetch the checkpoint from commit1.
and finally writer1 will update the commit metadata to \{writer1 =
checkpointVal2}
btw, Please check when was the multiple checkpoint support was added. if it was
added before 0.13.0, we need to ensure its backwards compatible as well. if
not, we are good. Just fixing the exiting solution would suffice.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)