MikeTipico opened a new issue, #5685: URL: https://github.com/apache/hudi/issues/5685
**Problem Description** In our systems we have scenarios that require us to reload some data (sent to us backdated or fixed from source) into existing Hudi tables, however we are facing an issue when doing this since some of the old data has been sent to us using an older version of the schema. Can you kindly guide us on how to best approach this scenario to ensure that data can be reloaded into the Hudi tables in a complete and correct manner. **Reproduce** 1. Consume some old data with an older version of the schema 2. Consume new fresh data with an evolved version of the schema 3. Re-consume data from step #1 Expected Behaviour Data replayed with an older version of the schema is also evolved to match the current schema, given they are at least compatible, and thus not having to worry about which version of the schema the data was originally using. In addition, it would be nice to be able to 'omit' any schema issues and use default values / null in case of a mismatch. **Environment Description** - Hudi 0.10.0 - Spark 3.1.1 - Running on EMR 6.3.1 (using non Amazon jar files) **Additional Context** Data arrives in our system via Kafka in avro format. This is then picked up, enriched with a schema from the registry, and saved to S3. A Spark structured streaming application then picks up new files, does some simple cleaning up, and persists to Hudi. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
