[GitHub] [hudi] Armelabdelkbir opened a new issue, #6496: [SUPPORT] Hudi schema evolution, Null for oldest values

GitBox Thu, 25 Aug 2022 01:46:01 -0700


Armelabdelkbir opened a new issue, #6496:
URL: https://github.com/apache/hudi/issues/6496


   
   
   Hi everyone, i'm trying to test schema evolution for my cdc pipline 
(debizium + kafka)  with hudi 0.11.0  and spark structured streming , i follow 
this documentation, https://hudi.apache.org/docs/0.11.0/schema_evolution,
    does hudi manage well the schema evolution, it is necessary to restart the 
job, once it is done all the old values will be null, it takes into account 
only the values of the last commits, so the data not match my Postgres source  
? on my schema registry i can see V1 and V2  
   
     any ideas thanks
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Stop hudi streams, and drop hive tables
   2. add some columns ALTER TABLE <table> ADD COLUMN <column_name> character 
varying(50) DEFAULT 'toto'  ;
   3.  restart hudi spark jobs 
   4. select * from hudi _ro / _rt table ( or read parquet hudi format using 
spark)
   
   **Expected behavior**
   
   when i select my data it expected to see default value on the added column 
and not null values
   
   data on postgres source:
   ```
   cdc_hudi=> select test, test2, test3 from hudipart
   ;
    test | test2 | test3 
   ------+-------+-------
    toto | f     | Toto
    test | t     | Toto
    test | t     | Toto
    test | t     | Toto
    toto | f     | Toto
    toto | f     | Toto
    toto | f     | Toto
    toto | f     | Toto
    toto | f     | Toto
    toto | f     | Toto
    toto | f     | Toto
    test | t     | Toto
    test | t     | Toto
    test | t     | test3
    test | t     | test3
    toto | f     | Toto
    toto | f     | Toto
    toto | f     | Toto
    test | t     | test3
    toto | f     | Toto
    toto | f     | Toto
    toto | f     | Toto
   ```
   data on hudi parquets / hive tables :
   ```
   spark.sql("select _hoodie_commit_time as commitTime, test, test2, test3 from 
evolution ").show()
   ---------------------------------------
   +-----------------+----+-----+-----+
   |       commitTime|test|test2|test3|
   +-----------------+----+-----+-----+
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824102514494|null| null| null|
   |20220824132039517|null| null| null|
   |20220824132113066|null| null| null|
   |20220824132113066|null| null| null|
   |20220824132934016|test| true| null|
   |20220824135050368|test| true| null|
   |20220824135411903|test| true| null|
   |20220824135446080|test| true| null|
   |20220824135921176|test| true|test3|
   ```
   
   
   
   **Environment Description**
   
   * Hudi version  : 0.11.0
   
   * Spark version :3.1.4
   
   * Hive version :1.2.1000
   
   * Hadoop version : 2.7.3
   
   * Storage (HDFS)
   * schema registry
   * Kafka
   * debezium 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Armelabdelkbir opened a new issue, #6496: [SUPPORT] Hudi schema evolution, Null for oldest values

Reply via email to