[GitHub] [hudi] MikeTipico opened a new issue, #5685: [SUPPORT] Loading older data with old schema version into Hudi

GitBox Wed, 25 May 2022 07:41:14 -0700


MikeTipico opened a new issue, #5685:
URL: https://github.com/apache/hudi/issues/5685


   **Problem Description**
   In our systems we have scenarios that require us to reload some data (sent 
to us backdated or fixed from source) into existing Hudi tables, however we are 
facing an issue when doing this since some of the old data has been sent to us 
using an older version of the schema. Can you kindly guide us on how to best 
approach this scenario to ensure that data can be reloaded into the Hudi tables 
in a complete and correct manner. 
   
   **Reproduce**
   1. Consume some old data with an older version of the schema
   2. Consume new fresh data with an evolved version of the schema
   3. Re-consume data from step #1 
   
   Expected Behaviour
   Data replayed with an older version of the schema is also evolved to match 
the current schema, given they are at least compatible, and thus not having to 
worry about which version of the schema the data was originally using. In 
addition, it would be nice to be able to 'omit' any schema issues and use 
default values / null in case of a mismatch. 
   
   **Environment Description**
   - Hudi 0.10.0
   - Spark 3.1.1 
   - Running on EMR 6.3.1 (using non Amazon jar files)
   
   **Additional Context**
   Data arrives in our system via Kafka in avro format. This is then picked up, 
enriched with a schema from the registry, and saved to S3. A Spark structured 
streaming application then picks up new files, does some simple cleaning up, 
and persists to Hudi. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] MikeTipico opened a new issue, #5685: [SUPPORT] Loading older data with old schema version into Hudi

Reply via email to