davehagman opened a new pull request #3820:
URL: https://github.com/apache/hudi/pull/3820


   
   ## What is the purpose of the pull request
   In order to support multi-writer concurrency where one writer is the 
Deltastreamer, other writers must copy any checkpoint state from previous 
commits into their current one in order to prevent interleaved commits from 
crashing the deltastreamer. 
   
   The code that does this did the following (before this change):
   * Get all the keys for the *current* inflight commit metadata
   * Filter out any keys that are not specified in the metadata config (a list 
of keys to replace)
   * For keys that exist in the current metadata, pull the data for that key 
from the *previous* commit and replace the current commit's metadata property 
value with that value
   
   This does not work because a non-deltastreamer writer (such as a spark 
datasource writer) will never have the checkpoint key specified in its commit 
metadata (`deltastreamer.checkpoint.key`) which results in a commit in the 
timeline that does not have checkpoint state. If the deltastreamer tries to 
start from that commit it will fail.
   
   This fixes that by changing the keyset that is filtered from the current 
commit to the previous commit. This fixes two issues:
   1. Checkpoint state is copied over from a previous commit which was made by 
the deltastreamer
   2. If the deltastreamer process fails or is stopped for a prolonged period 
of time, the non-deltastreamer writers will continue to carry over the 
checkpoint state which will allow the deltastreamer to correctly start from its 
last known position
   
   
   ## Brief change log
   
   *(for example:)*
     - *Modify `TransactionUtils::overrideWithLatestCommitMetadata` to pull the 
keys from the last commit instead of the current commit*
   
   ## Verify this pull request
   * Manually verified the change by running multiple writers against the same 
table
   * Writer One: Deltastreamer, kafka source
   * Writer Two:  Spark datasource, event data from existing hudi table
   * Verified zero errors from deltastreamer over hundreds of interleaved 
commits
   * Shut down deltastreamer for a prolonged period, then verified that I could 
start it back up without losing its position in kafka (checkpoint state in tact 
on recent commits)
   
   ## Committer checklist
   
    - [x ] Has a corresponding JIRA in PR title & commit
    
    - [x] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ x] Necessary doc changes done or have another open PR
          
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to