jasondavindev opened a new issue, #5469:
URL: https://github.com/apache/hudi/issues/5469
**Describe the problem you faced**
I'm writing an application to upsert records from a table. The problem is
when an upsert operation is done, the ordering column of records that exists in
base table and not exists in incoming data is overwritten to invalid value.
E.g.
The base table has a record with `id = 1` and `createddate = 2022-04-01`
The incoming data has a record with `id = 2` and `createddate = 2022-04-02`
After upsert operation the createddate of record with `id = 1` is changed to
`1970-xx-xx` and the record with `id = 2` remains intact.
**To Reproduce**
```python
from pyspark.sql.functions import expr
from pyspark.sql import DataFrame, SparkSession
database = 'db'
table = 'tb'
table_path = f'/{database}/{table}'
spark = SparkSession.builder.config(
'spark.sql.shuffle.partitions', '4').enableHiveSupport().getOrCreate()
options = {
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.CustomKeyGenerator',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'field:simple',
'hoodie.datasource.write.precombine.field': 'createddate',
'hoodie.payload.event.time.field': 'createddate',
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.table.name': table,
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.datasource.hive_sync.support_timestamp': 'true',
'hoodie.datasource.hive_sync.database': database,
'hoodie.datasource.hive_sync.table': table,
'hoodie.datasource.hive_sync.partition_fields': 'field',
}
full = spark.read.parquet(
'/opt/spark/conf/full/')
delta = spark.read.json(
'/opt/spark/conf/delta')
full_parse: DataFrame = full \
.withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as
timestamp)'))
delta_parse: DataFrame = delta \
.withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as
timestamp)'))
full_parse \
.write \
.format('org.apache.hudi') \
.options(**options) \
.option('hoodie.datasource.write.operation', 'bulk_insert') \
.mode('overwrite') \
.save(table_path)
delta_parse \
.write \
.format('org.apache.hudi') \
.options(**options) \
.option('hoodie.datasource.write.operation', 'upsert') \
.mode('append') \
.save(table_path)
```
Example full file content
```
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
|createdbyid |createddate |datatype |field |id
|isdeleted|newvalue |oldvalue|parentid |
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
|0055G00000808dFQAQ|2022-03-16
16:55:13|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false |Visita
Cancelada|null |a015G00000kpbM3QAI|
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
```
After upsert operation
```
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
|createdbyid |createddate |datatype |field
|id |isdeleted|newvalue |oldvalue |parentid
|
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
|0055G00000808dFQAQ|1970-01-20
01:37:29.713|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false
|Visita Cancelada|null |a015G00000kpbM3QAI|
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
```
**Environment Description**
* Hudi version : 0.10.0
* Spark version : 3.1.2
* Storage (HDFS/S3/GCS..) : Local
* Running on Docker? (yes/no) : Yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]