[GitHub] [hudi] TarunMootala commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

GitBox Tue, 01 Mar 2022 07:52:00 -0800


TarunMootala commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1055586449



   Thanks for the response. It's working when the fields are added at the end. 
I've a use case to populate default values for missing fields at any position. 
Is there any option to do so? 
   
   Also, I've another observation. The new fields that are added at the end are 
not reflecting when using spark sql. Provided the PySpark code that I've used 
in notebook. 
   
   ```
   # Added the jar /usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar 
to spark.jars as recommended by AWS EMR  team to resolve the version conflict 
in JsonUnmarshallerContext inside hudi-spark-bundle.jar
   
   table_name = "test_hudi_table7"
   table_path =  f"s3://<bucket_name>/Hudi/{table_name}"
   
   inputDF = spark.createDataFrame(
       [
           ("100", "AAA", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
           ("101", "BBB", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
           ("102", "CCC", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
           ("103", "DDD", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
           ("104", "EEE", "2015-01-01", "2015-01-01T12:15:00.512679Z"),
           ("105", "FFF", "2015-01-01", "2015-01-01T13:51:42.248818Z")
       ],
       ["id", "name", "creation_date", "last_update_time"]
   )
   
   hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.write.reconcile.schema': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database':'streaming_dev',
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.NonPartitionedExtractor'
   }
   
   inputDF.write\
   .format('hudi')\
   .option('hoodie.datasource.write.operation', 'insert')\
   .options(**hudiOptions)\
   .mode('overwrite')\
   .save(table_path)
   
   
   # Added 2 new fields at the end 
   
   inputDF = spark.createDataFrame(
       [
           ("106", "AAA", "2015-01-01", "2015-01-01T13:51:39.340396Z", 
"2015-01-01", "2015-01-01"),
           ("107", "BBB", "2015-01-01", "2015-01-01T12:14:58.597216Z", 
"2015-01-01", "2015-01-01"),
           ("108", "CCC", "2015-01-01", "2015-01-01T13:51:40.417052Z", 
"2015-01-01", "2015-01-01"),
           ("109", "DDD", "2015-01-01", "2015-01-01T13:51:40.519832Z", 
"2015-01-01", "2015-01-01"),
           ("110", "EEE", "2015-01-01", "2015-01-01T12:15:00.512679Z", 
"2015-01-01", "2015-01-01"),
           ("111", "FFF", "2015-01-01", "2015-01-01T13:51:42.248818Z", 
"2015-01-01", "2015-01-01")
       ],
       ["id", "name", "creation_date", "last_update_time", "creation_date1", 
"creation_date2"]
   )
   
   hudiOptions = {
   'hoodie.table.name': table_name,
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.precombine.field': 'last_update_time',
   'hoodie.datasource.write.reconcile.schema': 'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.database':'streaming_dev',
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.NonPartitionedExtractor'
   }
   
   print(table_name, table_path)
   
   inputDF.write\
   .format('hudi')\
   .option('hoodie.datasource.write.operation', 'upsert')\
   .options(**hudiOptions)\
   .mode('append')\
   .save(table_path)
   
   }
   
   spark.read.format('hudi').load(table_path).show() # can see the new fields 
added 
   spark.sql('select * from <table name>').show() # can't see the new fields 
   spark.table('<table name>').show() # can't see the new fields 
   
   ```
   
   Note: The same example is working in EMR 6.4 (Hudi 0.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] TarunMootala commented on issue #4914: [SUPPORT] reconcile schema failing to inject default values for missing fields

Reply via email to