envomp opened a new pull request, #8738:
URL: https://github.com/apache/hudi/pull/8738
Our current flows are as follows:
fetch schema:
- Fetch desired table schema in Avro format from schema registry
- Get a respective dataset schema given desired table Avro schema
- Convert respective dataset schema back to Avro schema to get an unified
schema
transform input:
- Consume Kafka via Spark Streaming and receive `RDD<GenericRecord>`
- Rewrite the RDD to unified schema to resolve version differences and to
end up with desired schema where some fields get dropped and some datatypes get
changed. Enum => String for example
- We use a copy of the org.apache.hudi.avro.HoodieAvroUtils class to do
the rewrite and require it to support aforementioned rewrite procedure
- Hopefully we can make the changes in public repo, so we don't need to
maintain a custom rewrite class
- Convert `RDD<GenricRecord>` to `Dataset<Row>`
There are situations where we read input from S3 directly resulting
Dataset<Row> which needs to be backfilled to table, so maintaining control over
the schema on writer side is benefitial for us.
### Change Logs
Rewrite to support enum => string conversion
### Impact
no impact
### Risk level (write none, low medium or high below)
none
### Documentation Update
no documentation needed
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]