davehagman opened a new pull request #3820:
URL: https://github.com/apache/hudi/pull/3820
## What is the purpose of the pull request
In order to support multi-writer concurrency where one writer is the
Deltastreamer, other writers must copy any checkpoint state from previous
commits into their current one in order to prevent interleaved commits from
crashing the deltastreamer.
The code that does this did the following (before this change):
* Get all the keys for the *current* inflight commit metadata
* Filter out any keys that are not specified in the metadata config (a list
of keys to replace)
* For keys that exist in the current metadata, pull the data for that key
from the *previous* commit and replace the current commit's metadata property
value with that value
This does not work because a non-deltastreamer writer (such as a spark
datasource writer) will never have the checkpoint key specified in its commit
metadata (`deltastreamer.checkpoint.key`) which results in a commit in the
timeline that does not have checkpoint state. If the deltastreamer tries to
start from that commit it will fail.
This fixes that by changing the keyset that is filtered from the current
commit to the previous commit. This fixes two issues:
1. Checkpoint state is copied over from a previous commit which was made by
the deltastreamer
2. If the deltastreamer process fails or is stopped for a prolonged period
of time, the non-deltastreamer writers will continue to carry over the
checkpoint state which will allow the deltastreamer to correctly start from its
last known position
## Brief change log
*(for example:)*
- *Modify `TransactionUtils::overrideWithLatestCommitMetadata` to pull the
keys from the last commit instead of the current commit*
## Verify this pull request
* Manually verified the change by running multiple writers against the same
table
* Writer One: Deltastreamer, kafka source
* Writer Two: Spark datasource, event data from existing hudi table
* Verified zero errors from deltastreamer over hundreds of interleaved
commits
* Shut down deltastreamer for a prolonged period, then verified that I could
start it back up without losing its position in kafka (checkpoint state in tact
on recent commits)
## Committer checklist
- [x ] Has a corresponding JIRA in PR title & commit
- [x] Commit message is descriptive of the change
- [ ] CI is green
- [ x] Necessary doc changes done or have another open PR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]