n3nash commented on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-798875296


   @vinothchandar It's possible to allow backfills using spark-sql but there 
are some corner cases. Consider the following:
   
   1. Ingestion job running with commit c4 (checkpoint = c3)
   2. Ingestion job finishes with commit c4 (checkpoint = c3)
   3. Someone runs a spark-sql job to backfill some data in an older partition, 
commit c5. Since this spark-sql job (unlike deltastreamer) does not handle 
checkpoint copying from prev metadata to next, it would be the client's job to 
do this. 
   4. If they fail to do this, deltastreamer next ingestion c6 will read no 
checkpoint from c5. 
   
   I've made the following changes:
   
   1) To make this manageable, I've added the following config : 
`hoodie.write.meta.key.prefixes`. One can set this config to ensure that during 
the critical section, if this config is set, it will copy over all the metadata 
for the keys that match with the prefix set in this config from the latest 
metadata to the current commit.
   2) Added these multi-writer tests to `HoodieDeltaStreamer` as well. 
   
   NOTE: The test may be failing because of some thread.sleep related code that 
I'm trying to remove. Will update tomorrow.  
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to