n3nash edited a comment on pull request #2374: URL: https://github.com/apache/hudi/pull/2374#issuecomment-798875296
@vinothchandar It's possible to allow backfills using spark-sql but there are some corner cases. Consider the following: 1. Ingestion job running with commit c4 (checkpoint = c3) 2. Ingestion job finishes with commit c4 (checkpoint = c3) 3. Someone runs a spark-sql job to backfill some data in an older partition, commit c5. Since this spark-sql job (unlike deltastreamer) does not handle checkpoint copying from prev metadata to next, it would be the client's job to do this. 4. If they fail to do this, deltastreamer next ingestion c6 will read no checkpoint from c5. I've made the following changes: 1) To make this manageable, I've added the following config : `hoodie.write.meta.key.prefixes`. One can set this config to ensure that during the critical section, if this config is set, it will copy over all the metadata for the keys that match with the prefix set in this config from the latest metadata to the current commit. 2) Made changes and added these multi-writer tests to `HoodieDeltaStreamer` as well. Technically, one can do the backfill using `HoodieDeltaStreamer` or `Spark-SQL`. For `HoodieDeltaStreamer` they would have to set some custom checkpoint or mark it to null to ensure that the job just picks the data from the backfill location, for Spark-SQL it would not matter. Yes, I am going to add documents on best practices / things to watch out in the other PR I opened for documentation. I will do that after resolving any further comments and landing this PR in the next couple of days. NOTE: The test may be failing because of some thread.sleep related code that I'm trying to remove. Will update tomorrow. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
