[GitHub] [hudi] nsivabalan commented on a change in pull request #3485: [WIP] [WEBSITE] Added blog "The schema evolution story of Hudi"

GitBox Fri, 20 Aug 2021 07:47:41 -0700


nsivabalan commented on a change in pull request #3485:
URL: https://github.com/apache/hudi/pull/3485#discussion_r692983264




##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"

Review comment:
       can you please fix the excerpt also based on suggested title

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"

Review comment:
       I would prefer to keep the title bit specific. 
   "Schema evolution with DeltaStreamer using KafkaSource"

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.
+![Schema 
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png)
+The default deserializer used by Hudi 
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that 
that exact record was written with for deserialization. This causes Hudi to get 
records with multiple different schema from the kafka client. E.g. Event #13 
has the new attribute `myattribute`, Event #14 dont has the new attribute 
`myattribute`. This makes things complicated and error-prone for Hudi.

Review comment:
       lets add a line break at the beginning just after image

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.
+![Schema 
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png)
+The default deserializer used by Hudi 
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that 
that exact record was written with for deserialization. This causes Hudi to get 
records with multiple different schema from the kafka client. E.g. Event #13 
has the new attribute `myattribute`, Event #14 dont has the new attribute 
`myattribute`. This makes things complicated and error-prone for Hudi.
+
+![Confluent 
Deserializer](/assets/images/blog/hudi-schema-evolution/confluent_deserializer.png)
+
+# Solution
+We can use a custom Deserializer `KafkaAvroSchemaDeserializer` and plug it 
into the kafka client.
+As first step the Deserializer gets the source schema from the Hudi 
SchemaProvider. The SchemaProvider can get the schema for example from a 
Confluent Schema-Registry or a file.
+The Deserializer then reads the records from the topic with the schema the 
record was written. As next step it will convert all the records to the source 
schema from the SchemaProvider, in our case the latest schema. As a result, the 
kafka client will return all records with a unified schema. Hudi does not need 
to handle different schemas inside a single batch.

Review comment:
       lets add a line break at the end before the image. 

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.
+![Schema 
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png)
+The default deserializer used by Hudi 
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that 
that exact record was written with for deserialization. This causes Hudi to get 
records with multiple different schema from the kafka client. E.g. Event #13 
has the new attribute `myattribute`, Event #14 dont has the new attribute 
`myattribute`. This makes things complicated and error-prone for Hudi.
+
+![Confluent 
Deserializer](/assets/images/blog/hudi-schema-evolution/confluent_deserializer.png)
+
+# Solution
+We can use a custom Deserializer `KafkaAvroSchemaDeserializer` and plug it 
into the kafka client.
+As first step the Deserializer gets the source schema from the Hudi 
SchemaProvider. The SchemaProvider can get the schema for example from a 
Confluent Schema-Registry or a file.
+The Deserializer then reads the records from the topic with the schema the 
record was written. As next step it will convert all the records to the source 
schema from the SchemaProvider, in our case the latest schema. As a result, the 
kafka client will return all records with a unified schema. Hudi does not need 
to handle different schemas inside a single batch.

Review comment:
       minor. 
   ```
   ... with a unified schema i.e. the latest schema as per schema registry."
   ```

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.
+![Schema 
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png)
+The default deserializer used by Hudi 
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that 
that exact record was written with for deserialization. This causes Hudi to get 
records with multiple different schema from the kafka client. E.g. Event #13 
has the new attribute `myattribute`, Event #14 dont has the new attribute 
`myattribute`. This makes things complicated and error-prone for Hudi.
+
+![Confluent 
Deserializer](/assets/images/blog/hudi-schema-evolution/confluent_deserializer.png)
+
+# Solution
+We can use a custom Deserializer `KafkaAvroSchemaDeserializer` and plug it 
into the kafka client.
+As first step the Deserializer gets the source schema from the Hudi 
SchemaProvider. The SchemaProvider can get the schema for example from a 
Confluent Schema-Registry or a file.

Review comment:
       can we replace "source schema" to "latest schema". Not sure if readers 
will understand the nitty gritty details that in schema provider we have source 
and target schema etc. For the context of this blog, we can assume schema 
provider has only one schema and so we can avoid talking about "source" schema. 
   Can you please fix all such places to be consistent with this. 

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.
+![Schema 
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png)
+The default deserializer used by Hudi 
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that 
that exact record was written with for deserialization. This causes Hudi to get 
records with multiple different schema from the kafka client. E.g. Event #13 
has the new attribute `myattribute`, Event #14 dont has the new attribute 
`myattribute`. This makes things complicated and error-prone for Hudi.

Review comment:
       minor. 
   ```
   ... uses the schema that the record was serialized with" 

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.

Review comment:
       minor. Some rephrasing and also lets add a one liner about Deltastreamer 
just incase someone is not aware of it. 
   
   ```
   "We have multiple instances of Deltastreamer running, consuming many topics 
with different schemas ingesting to the same hudi table. Deltastreamer is a 
utility in Hudi to assist in ingesting data from multiple sources like DFS, 
kafka, etc into Hudi. If interested, you can read more about DeltaStreamer tool 
[here](https://hudi.apache.org/docs/writing_data#deltastreamer)".
   ```

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.

Review comment:
       ```
   .... to the kafka topic. In regular flow of events, all records are in the 
same schema V1 and is in sync with schema registry. "
   ```
   

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.
+![Schema 
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png)
+The default deserializer used by Hudi 
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that 
that exact record was written with for deserialization. This causes Hudi to get 
records with multiple different schema from the kafka client. E.g. Event #13 
has the new attribute `myattribute`, Event #14 dont has the new attribute 
`myattribute`. This makes things complicated and error-prone for Hudi.

Review comment:
       minor: remove extra "that".
   "the schema that that exact"

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.

Review comment:
       line break at the end. 
   ```
   ...to the kafka topic. <br/>
   ```

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.
+![Schema 
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png)
+The default deserializer used by Hudi 
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that 
that exact record was written with for deserialization. This causes Hudi to get 
records with multiple different schema from the kafka client. E.g. Event #13 
has the new attribute `myattribute`, Event #14 dont has the new attribute 
`myattribute`. This makes things complicated and error-prone for Hudi.
+
+![Confluent 
Deserializer](/assets/images/blog/hudi-schema-evolution/confluent_deserializer.png)
+
+# Solution
+We can use a custom Deserializer `KafkaAvroSchemaDeserializer` and plug it 
into the kafka client.

Review comment:
       minor. 
   ```
   Hudi added a new custom Deserializer `KafkaAvroSchemaDeserializer` to solve 
this problem of different producers producing records in different schema 
versions, but to use the latest schema from schema registry to deserialize all 
the records" .
   ```

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.
+![Schema 
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png)
+The default deserializer used by Hudi 
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that 
that exact record was written with for deserialization. This causes Hudi to get 
records with multiple different schema from the kafka client. E.g. Event #13 
has the new attribute `myattribute`, Event #14 dont has the new attribute 
`myattribute`. This makes things complicated and error-prone for Hudi.
+
+![Confluent 
Deserializer](/assets/images/blog/hudi-schema-evolution/confluent_deserializer.png)
+
+# Solution
+We can use a custom Deserializer `KafkaAvroSchemaDeserializer` and plug it 
into the kafka client.
+As first step the Deserializer gets the source schema from the Hudi 
SchemaProvider. The SchemaProvider can get the schema for example from a 
Confluent Schema-Registry or a file.
+The Deserializer then reads the records from the topic with the schema the 
record was written. As next step it will convert all the records to the source 
schema from the SchemaProvider, in our case the latest schema. As a result, the 
kafka client will return all records with a unified schema. Hudi does not need 
to handle different schemas inside a single batch.
+![KafkaAvroSchemaDeserializer](/assets/images/blog/hudi-schema-evolution/KafkaAvroSchemaDeserializer.png)
+
+# How to use this solution
+As of the coming release 0.9.0 the KafkaAvroSchemaDeserializer is not turned 
on by default, instead the normal Confluent Deseializer is used.
+You must set 
`hoodie.deltastreamer.source.kafka.value.deserializer.class=org.apache.hudi.utilities.deser.KafkaAvroSchemaDeserializer`
 to use the solution described in this blog post.

Review comment:
       In the last diagram, in hudi, its written as "Residing data, multiple 
event V1". Guess it should be V2. 

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.

Review comment:
       again, lets start with a line break.
   ```
   <br/> Things get complicated ...
   ```

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.

Review comment:
       minor. ".... maintain the schema and as schema evolves, newer versions 
are updated in the schema registry".  

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.

Review comment:
       Lets reword this a bit.
   ```
   Deltastreamer is capable of handling such schema evolution, if all incoming 
records were evolved and serialized with evolved schema.  But the complication 
is that, some records are serialized with schema version V1 and some are 
serialized with schema version V2. 
   ```

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.

Review comment:
       lets add a line break at the end before the image.

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.

Review comment:
       minor (comma) : "For this to work, we run our Deltastreamer..."

##########
File path: website/blog/2021-08-16-schema-evolution.md
##########
@@ -0,0 +1,35 @@
+---
+title: "The schema evolution story of Hudi"
+excerpt: "Evolve schema to keep data up to date with business"
+author: sbernauer
+category: blog
+---
+
+The schema used for data exchange between services can change change rapidly 
with new business requirements.
+Apache Hudi is often used in combination with kafka as a Event Stream where 
all events are transmitted according to an Record Schema. In our case a 
Confluent Schema Registry is used to maintain the Schemas and the different 
Version every Schemas can evolve.
+
+# What do we want to achieve?
+We have multiple instances of Deltastreamer running consuming many topics with 
different schemas.
+Ideally every Topic should be able to evolve the schema to match new business 
requirements. Consumers start producing data with a new schema version and the 
Deltastreamer picks up the new schema and ingests the data with the new schema. 
For this to work we run our Deltastreamer instances with the latest schema 
version available from the Schema Registry to ensure that we always use the 
freshest schema with all attributes.
+A prerequisites it that all the mentioned Schema evolutions must be 
`BACKWARD_TRANSITIVE` compatible (see [Schema Evolution and Compatibility of 
Avro Schema 
changes](https://docs.confluent.io/platform/current/schema-registry/avro.html). 
This ensures that every record in the kafka topic can always be read using the 
latest schema.
+
+
+# What is the problem?
+The normal operation looks like this. Multiple (or a single) producers write 
records to the kafka topic.
+![Normal 
operation](/assets/images/blog/hudi-schema-evolution/normal_operation.png)
+Things get complicated when a producer switches to a new Writer-Schema v2 (in 
this case `Producer A`). `Producer B` remains on Schema v1. E.g. a attribute 
`myattribute` was added to the schema, resulting in schema version v2.
+So Deltastreamer must not only be able to handle Events that suddenly have a 
new Schema but also parallel operation of different Schema versions.
+![Schema 
evolution](/assets/images/blog/hudi-schema-evolution/schema_evolution.png)
+The default deserializer used by Hudi 
`io.confluent.kafka.serializers.KafkaAvroDeserializer` uses the schema that 
that exact record was written with for deserialization. This causes Hudi to get 
records with multiple different schema from the kafka client. E.g. Event #13 
has the new attribute `myattribute`, Event #14 dont has the new attribute 
`myattribute`. This makes things complicated and error-prone for Hudi.
+
+![Confluent 
Deserializer](/assets/images/blog/hudi-schema-evolution/confluent_deserializer.png)
+
+# Solution
+We can use a custom Deserializer `KafkaAvroSchemaDeserializer` and plug it 
into the kafka client.
+As first step the Deserializer gets the source schema from the Hudi 
SchemaProvider. The SchemaProvider can get the schema for example from a 
Confluent Schema-Registry or a file.
+The Deserializer then reads the records from the topic with the schema the 
record was written. As next step it will convert all the records to the source 
schema from the SchemaProvider, in our case the latest schema. As a result, the 
kafka client will return all records with a unified schema. Hudi does not need 
to handle different schemas inside a single batch.
+![KafkaAvroSchemaDeserializer](/assets/images/blog/hudi-schema-evolution/KafkaAvroSchemaDeserializer.png)
+
+# How to use this solution
+As of the coming release 0.9.0 the KafkaAvroSchemaDeserializer is not turned 
on by default, instead the normal Confluent Deseializer is used.

Review comment:
       ```
   As of upcoming release 0.9.0, normal Confluent Deserializer is used by 
default. One has to explicitly set KafkaAvroSchemaDeserializer as below, in 
order to ensure smooth schema evolution with different producers producing 
records in different versions. 
   
   
`hoodie.deltastreamer.source.kafka.value.deserializer.class=org.apache.hudi.utilities.deser.KafkaAvroSchemaDeserializer`
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a change in pull request #3485: [WIP] [WEBSITE] Added blog "The schema evolution story of Hudi"

Reply via email to