[GitHub] [hudi] nsivabalan commented on pull request #3485: [HUDI-2348] Added blog on "Schema evolution with DeltaStreamer using KafkaSource"

GitBox Wed, 25 Aug 2021 19:09:54 -0700


nsivabalan commented on pull request #3485:
URL: https://github.com/apache/hudi/pull/3485#issuecomment-906019219

I tried to update the patch myself, and looks like I couldn't after website
redesign.

Here are the changes. Can you apply and update the patch.

```
diff --git a/website/blog/2021-08-16-schema-evolution.md
b/website/blog/2021-08-16-kafka-custom-deserializer.md
similarity index 81%
rename from website/blog/2021-08-16-schema-evolution.md
rename to website/blog/2021-08-16-kafka-custom-deserializer.md
index b33c83755..302209990 100644
--- a/website/blog/2021-08-16-schema-evolution.md
+++ b/website/blog/2021-08-16-kafka-custom-deserializer.md
@@ -6,34 +6,42 @@ category: blog
---

The schema used for data exchange between services can change change
rapidly with new business requirements.
-Apache Hudi is often used in combination with kafka as a event stream where
all events are transmitted according to an record schema. In our case a
Confluent schema registry is used to maintain the schema and as schema evolves,
newer versions are updated in the schema registry.
+Apache Hudi is often used in combination with kafka as a event stream where
all events are transmitted according to an record schema.
+In our case a Confluent schema registry is used to maintain the schema and
as schema evolves, newer versions are updated in the schema registry.
+

-# What do we want to achieve?
+## What do we want to achieve?
We have multiple instances of DeltaStreamer running, consuming many topics
with different schemas ingesting to multiple Hudi tables. Deltastreamer is a
utility in Hudi to assist in ingesting data from multiple sources like DFS,
kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool
[here](https://hudi.apache.org/docs/writing_data#deltastreamer)
Ideally every Topic should be able to evolve the schema to match new
business requirements. Consumers start producing data with a new schema version
and the DeltaStreamer picks up the new schema and ingests the data with the new
schema. For this to work, we run our DeltaStreamer instances with the latest
schema version available from the Schema Registry to ensure that we always use
the freshest schema with all attributes.
A prerequisites it that all the mentioned Schema evolutions must be
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of
Avro Schema
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html).
This ensures that every record in the kafka topic can always be read using the
latest schema.

-# What is the problem?
+## What is the problem?
The normal operation looks like this. Multiple (or a single) producers
write records to the kafka topic.
In regular flow of events, all records are in the same schema v1 and is in
sync with schema registry.
-![Normal
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png) 
+![Normal
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png) 
Things get complicated when a producer switches to a new Writer-Schema v2
(in this case `Producer A`). `Producer B` remains on Schema v1. E.g. a
attribute `myattribute` was added to the schema, resulting in schema version v2.
Deltastreamer is capable of handling such schema evolution, if all incoming
records were evolved and serialized with evolved schema. But the complication
is that, some records are serialized with schema version v1 and some are
serialized with schema version v2.

-![Schema
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png) 
+![Schema
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png) 
The default deserializer used by Hudi
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that the
record was serialized with for deserialization. This causes Hudi to get records
with multiple different schema from the kafka client. E.g. Event #13 has the
new attribute `myattribute`, Event #14 dont has the new attribute
`myattribute`. This makes things complicated and error-prone for Hudi.

-![Confluent
Deserializer](/assets/images/blog/hudi-schema-evolution/confluent_deserializer.png) 
+![Confluent
Deserializer](/assets/images/blog/hudi-schema-evolution/confluent_deserializer.png)

-# Solution
-Hudi added a new custom Deserializer `KafkaAvroSchemaDeserializer` to solve
this problem of different producers producing records in different schema
versions, but to use the latest schema from schema registry to deserialize all
the records. 
+## Solution
+Hudi added a new custom Deserializer `KafkaAvroSchemaDeserializer` to solve
this problem of different producers producing records in different schema
versions, but to use the latest schema from schema registry to deserialize all
the records. 
As first step the Deserializer gets the latest schema from the Hudi
SchemaProvider. The SchemaProvider can get the schema for example from a
Confluent Schema-Registry or a file.
The Deserializer then reads the records from the topic with the schema the
record was written. As next step it will convert all the records to the latest
schema from the SchemaProvider, in our case the latest schema. As a result, the
kafka client will return all records with a unified schema i.e. the latest
schema as per schema registry. Hudi does not need to handle different schemas
inside a single batch.

-![KafkaAvroSchemaDeserializer](/assets/images/blog/hudi-schema-evolution/KafkaAvroSchemaDeserializer.png)

+![KafkaAvroSchemaDeserializer](/assets/images/blog/hudi-schema-evolution/KafkaAvroSchemaDeserializer.png)

-# How to use this solution
-As of upcoming release 0.9.0, normal Confluent Deserializer is used by
default. One has to explicitly set KafkaAvroSchemaDeserializer as below, in
order to ensure smooth schema evolution with different producers producing
records in different versions.
+## Configurations
+As of upcoming release 0.9.0, normal Confluent Deserializer is used by
default. One has to explicitly set KafkaAvroSchemaDeserializer as below,
+in order to ensure smooth schema evolution with different producers
producing records in different versions.

`hoodie.deltastreamer.source.kafka.value.deserializer.class=org.apache.hudi.utilities.deser.KafkaAvroSchemaDeserializer`
+
+## Conclusion
+Hope this blog helps in ingesting data from kakfa into Hudi using
Deltastreamer tool catering to different schema evolution
+needs. Hudi has a very active development community and we look forward for
more contributions.
+Please check out [this](https://hudi.apache.org/contribute/get-involved)
link to start contributing.
\ No newline at end of file
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on pull request #3485: [HUDI-2348] Added blog on "Schema evolution with DeltaStreamer using KafkaSource"

Reply via email to