kazdy opened a new issue, #5873:
URL: https://github.com/apache/hudi/issues/5873
**Describe the problem you faced**
I'm using schema on read (full schema evolution feature) and reconcile
schema feature to evolve hudi table schema, it's synchronized with Glue Data
Catalog. COW table.
I add a column (col_a) in the middle of the table in one batch (upsert
operation).
In the next batch (upsert) I add new column at the end of the table (col_b)
but col_a is missing in data frame.
Then I query the table via Athena or via Spark SQL, then col_a is dropped
and not visible.
I can upsert next batch with df that contains both col_a and col_b, then all
data is visible in Spark and Athena.
I would expect that during the schema reconciliation phase Hudi would handle
this case and preserve col_1 with a null value.
**To Reproduce**
Steps to reproduce the behavior:
Operations, step by step
| Batch seq | Operation | DF schema |
Table Schema | Expected Table Schema
|
|-----------|-----------|---------------------------------------------|---------------------------------------------|------------------------------------------------------------|
| 0 | insert | col_1: string,col_2: string |
col_1: string,col_2: string | col_1: string,col_2: string
|
| 1 | upsert | col_1: string, col_a: string, col_2: string |
col_1: string,col_a: string,col_2: string | col_1: string,col_a:
string,col_2: string |
| 2 | upsert | col_1: string, col_2: string, col_b: string |
col_1: string, col_2: string, col_b: string | col_1: string, col_a: string,
col_2: string, col_b: string |
**Expected behavior**
In batch 2 table should have schema:
col_1: string, col_a: string, col_2: string, col_b: string
with col_a preserved with null values where column is missing
**Environment Description**
* Hudi version : 0.11.0 OSS
* Spark version : 3.2.0-amzn
* Hive version : 3.2.1
* Hadoop version : 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : yes/ emr on eks 6.6
**Additional context**
**Stacktrace**
```Add the stacktrace of the error.```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]