tandonraghav opened a new issue #2919:
URL: https://github.com/apache/hudi/issues/2919
I am facing issue in Schema Evolution. While adding a new field to the Spark
DF, it is giving exception if there are previous Log files/Records which do not
have that field.
I can see *type* is reversed in *test* and there is no default value(In
Hoodie Log files). Is it because of the SchemaConvertors?
**Environment Description**
* Hudi version : 0.8.0
* Spark version : 2.4
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) : no
Code
1. Create the DF
````
Dataset<Row> ds = AvroConversionUtils.createDataFrame(kafka.rdd(),
schemaStr, sparkSession);
Dataset<Row> insertedDs=ds.select("*").where(ds.col("op").notEqual("d"));
````
2. Persist this ds with Original Schema first and call it few times, to make
sure some uncompacted Log files are there.
3. Persist this ds again with New schema and it will throw Error
**Caused by: org.apache.avro.AvroTypeException: Found
hoodie.products.products_record, expecting hoodie.products.products_record,
missing required field test2**
Our schema is dynamic and I am not removing any field rather adding a field
to the end, then also it is failing.
Original Schema
````
{
"type": "record",
"name": "foo",
"namespace": "products",
"fields": [
{
"name": "id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "product_id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "db_name",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "catalog_id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "feed_id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "ts_ms",
"type": [
"null",
"double"
],
"default": null
},
{
"name": "op",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "test",
"type": [
"null",
"string"
],
"default": null
}
]
}
````
Changed Schema
````
{
"type": "record",
"name": "foo",
"namespace": "products",
"fields": [
{
"name": "id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "product_id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "db_name",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "catalog_id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "feed_id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "ts_ms",
"type": [
"null",
"double"
],
"default": null
},
{
"name": "op",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "test",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "test2",
"type": [
"null",
"string"
],
"default": null
}
]
}
````
Added a column **test2** in the end with default value.
Schema shown by Hoodie in Logs
```{
"type": "record",
"name": "Max_IND_record",
"namespace": "hoodie.Max_IND",
"fields": [
{
"name": "_hoodie_commit_time",
"type": [
"null",
"string"
],
"doc": "",
"default": null
},
{
"name": "_hoodie_commit_seqno",
"type": [
"null",
"string"
],
"doc": "",
"default": null
},
{
"name": "_hoodie_record_key",
"type": [
"null",
"string"
],
"doc": "",
"default": null
},
{
"name": "_hoodie_partition_path",
"type": [
"null",
"string"
],
"doc": "",
"default": null
},
{
"name": "_hoodie_file_name",
"type": [
"null",
"string"
],
"doc": "",
"default": null
},
{
"name": "id",
"type": [
"string",
"null"
]
},
{
"name": "product_id",
"type": [
"string",
"null"
]
},
{
"name": "db_name",
"type": [
"string",
"null"
]
},
{
"name": "catalog_id",
"type": [
"string",
"null"
]
},
{
"name": "feed_id",
"type": [
"string",
"null"
]
},
{
"name": "ts_ms",
"type": [
"double",
"null"
]
},
{
"name": "op",
"type": [
"string",
"null"
]
},
{
"name": "test",
"type": [
"string",
"null"
]
}
]
}
````
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]