[GitHub] [hudi] kazdy opened a new issue, #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

GitBox Wed, 15 Jun 2022 01:48:03 -0700


kazdy opened a new issue, #5873:
URL: https://github.com/apache/hudi/issues/5873


   **Describe the problem you faced**
   I'm using schema on read (full schema evolution feature) and reconcile 
schema feature to evolve hudi table schema, it's synchronized with Glue Data 
Catalog. COW table.
   
   I add a column (col_a) in the middle of the table in one batch (upsert 
operation).
   In the next batch (upsert) I add new  column at the end of the table (col_b) 
but col_a is missing in data frame.
   Then I query the table via Athena or via Spark SQL, then col_a is dropped 
and not visible.
   
   I can upsert next batch with df that contains both col_a and col_b, then all 
data is visible in Spark and Athena.
   
   I would expect that during the schema reconciliation phase Hudi would handle 
this case and preserve col_1 with a null value.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   Operations, step by step
   | Batch seq | Operation | DF schema                                   | 
Table Schema                                | Expected Table Schema             
                         |
   
|-----------|-----------|---------------------------------------------|---------------------------------------------|------------------------------------------------------------|
   | 0         | insert    | col_1: string,col_2: string                 | 
col_1: string,col_2: string                 | col_1: string,col_2: string       
                         |
   | 1         | upsert    | col_1: string, col_a: string, col_2: string | 
col_1: string,col_a: string,col_2: string   | col_1: string,col_a: 
string,col_2: string                  |
   | 2         | upsert    | col_1: string, col_2: string, col_b: string | 
col_1: string, col_2: string, col_b: string | col_1: string, col_a: string, 
col_2: string, col_b: string |
   
   **Expected behavior**
   
   In batch 2 table should have schema:
   col_1: string, col_a: string, col_2: string, col_b: string
   
   with col_a preserved with null values where column is missing
   
   **Environment Description**
   
   * Hudi version : 0.11.0 OSS
   
   * Spark version : 3.2.0-amzn
   
   * Hive version : 3.2.1
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes/ emr on eks 6.6
   
   
   **Additional context**
   
   
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kazdy opened a new issue, #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Reply via email to