TarunMootala commented on issue #4914:
URL: https://github.com/apache/hudi/issues/4914#issuecomment-1055586449
Thanks for the response. It's working when the fields are added at the end.
I've a use case to populate default values for missing fields at any position.
Is there any option to do so?
Also, I've another observation. The new fields that are added at the end are
not reflecting when using spark sql. Provided the PySpark code that I've used
in notebook.
```
# Added the jar /usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar
to spark.jars as recommended by AWS EMR team to resolve the version conflict
in JsonUnmarshallerContext inside hudi-spark-bundle.jar
table_name = "test_hudi_table7"
table_path = f"s3://<bucket_name>/Hudi/{table_name}"
inputDF = spark.createDataFrame(
[
("100", "AAA", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "BBB", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "CCC", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "DDD", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "EEE", "2015-01-01", "2015-01-01T12:15:00.512679Z"),
("105", "FFF", "2015-01-01", "2015-01-01T13:51:42.248818Z")
],
["id", "name", "creation_date", "last_update_time"]
)
hudiOptions = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.write.reconcile.schema': 'true',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database':'streaming_dev',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.NonPartitionedExtractor'
}
inputDF.write\
.format('hudi')\
.option('hoodie.datasource.write.operation', 'insert')\
.options(**hudiOptions)\
.mode('overwrite')\
.save(table_path)
# Added 2 new fields at the end
inputDF = spark.createDataFrame(
[
("106", "AAA", "2015-01-01", "2015-01-01T13:51:39.340396Z",
"2015-01-01", "2015-01-01"),
("107", "BBB", "2015-01-01", "2015-01-01T12:14:58.597216Z",
"2015-01-01", "2015-01-01"),
("108", "CCC", "2015-01-01", "2015-01-01T13:51:40.417052Z",
"2015-01-01", "2015-01-01"),
("109", "DDD", "2015-01-01", "2015-01-01T13:51:40.519832Z",
"2015-01-01", "2015-01-01"),
("110", "EEE", "2015-01-01", "2015-01-01T12:15:00.512679Z",
"2015-01-01", "2015-01-01"),
("111", "FFF", "2015-01-01", "2015-01-01T13:51:42.248818Z",
"2015-01-01", "2015-01-01")
],
["id", "name", "creation_date", "last_update_time", "creation_date1",
"creation_date2"]
)
hudiOptions = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.write.reconcile.schema': 'true',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database':'streaming_dev',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.NonPartitionedExtractor'
}
print(table_name, table_path)
inputDF.write\
.format('hudi')\
.option('hoodie.datasource.write.operation', 'upsert')\
.options(**hudiOptions)\
.mode('append')\
.save(table_path)
}
spark.read.format('hudi').load(table_path).show() # can see the new fields
added
spark.sql('select * from <table name>').show() # can't see the new fields
spark.table('<table name>').show() # can't see the new fields
```
Note: The same example is working in EMR 6.4 (Hudi 0.8)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]