n3nash commented on pull request #2374: URL: https://github.com/apache/hudi/pull/2374#issuecomment-798875296
@vinothchandar It's possible to allow backfills using spark-sql but there are some corner cases. Consider the following: 1. Ingestion job running with commit c4 (checkpoint = c3) 2. Ingestion job finishes with commit c4 (checkpoint = c3) 3. Someone runs a spark-sql job to backfill some data in an older partition, commit c5. Since this spark-sql job (unlike deltastreamer) does not handle checkpoint copying from prev metadata to next, it would be the client's job to do this. 4. If they fail to do this, deltastreamer next ingestion c6 will read no checkpoint from c5. I've made the following changes: 1) To make this manageable, I've added the following config : `hoodie.write.meta.key.prefixes`. One can set this config to ensure that during the critical section, if this config is set, it will copy over all the metadata for the keys that match with the prefix set in this config from the latest metadata to the current commit. 2) Added these multi-writer tests to `HoodieDeltaStreamer` as well. NOTE: The test may be failing because of some thread.sleep related code that I'm trying to remove. Will update tomorrow. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
