[GitHub] [hudi] jasondavindev opened a new issue, #5469: [SUPPORT] Upsert overwrting ordering field with invalid value

GitBox Fri, 29 Apr 2022 12:19:34 -0700


jasondavindev opened a new issue, #5469:
URL: https://github.com/apache/hudi/issues/5469


   **Describe the problem you faced**
   
   I'm writing an application to upsert records from a table. The problem is 
when an upsert operation is done, the ordering column of records that exists in 
base table and not exists in incoming data is overwritten to invalid value.
   E.g.
   The base table has a record with `id = 1` and `createddate = 2022-04-01`
   The incoming data has a record with `id = 2` and `createddate = 2022-04-02`
   
   After upsert operation the createddate of record with `id = 1` is changed to 
`1970-xx-xx` and the record with `id = 2` remains intact.
   
   **To Reproduce**
   ```python
   from pyspark.sql.functions import expr
   from pyspark.sql import DataFrame, SparkSession
   
   database = 'db'
   table = 'tb'
   table_path = f'/{database}/{table}'
   
   spark = SparkSession.builder.config(
       'spark.sql.shuffle.partitions', '4').enableHiveSupport().getOrCreate()
   
   options = {
       'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.CustomKeyGenerator',
       'hoodie.datasource.write.recordkey.field': 'id',
       'hoodie.datasource.write.partitionpath.field': 'field:simple',
       'hoodie.datasource.write.precombine.field': 'createddate',
       'hoodie.payload.event.time.field': 'createddate',
       'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
       'hoodie.table.name': table,
   
       'hoodie.datasource.write.hive_style_partitioning': 'true',
       'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       'hoodie.datasource.hive_sync.enable': 'true',
       'hoodie.datasource.hive_sync.mode': 'hms',
       'hoodie.datasource.hive_sync.support_timestamp': 'true',
       'hoodie.datasource.hive_sync.database': database,
       'hoodie.datasource.hive_sync.table': table,
       'hoodie.datasource.hive_sync.partition_fields': 'field',
   
   }
   
   full = spark.read.parquet(
       '/opt/spark/conf/full/')
   delta = spark.read.json(
       '/opt/spark/conf/delta')
   
   full_parse: DataFrame = full \
       .withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as 
timestamp)'))
   
   delta_parse: DataFrame = delta \
       .withColumn('createddate', expr(f'cast(substr(createddate, 1, 19) as 
timestamp)'))
   
   full_parse \
       .write \
       .format('org.apache.hudi') \
       .options(**options) \
       .option('hoodie.datasource.write.operation', 'bulk_insert') \
       .mode('overwrite') \
       .save(table_path)
   
   delta_parse \
       .write \
       .format('org.apache.hudi') \
       .options(**options) \
       .option('hoodie.datasource.write.operation', 'upsert') \
       .mode('append') \
       .save(table_path)
   ```
   
   Example full file content
   
   ```
   
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
   |createdbyid       |createddate        |datatype   |field              |id   
             |isdeleted|newvalue        |oldvalue|parentid          |
   
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
   |0055G00000808dFQAQ|2022-03-16 
16:55:13|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false    |Visita 
Cancelada|null    |a015G00000kpbM3QAI|
   
+------------------+-------------------+-----------+-------------------+------------------+---------+----------------+--------+------------------+
   ```
   
   After upsert operation
   
   ```
   
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
   |createdbyid       |createddate            |datatype   |field              
|id                |isdeleted|newvalue        |oldvalue              |parentid  
        |
   
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
   |0055G00000808dFQAQ|1970-01-20 
01:37:29.713|DynamicEnum|Status_do_Imovel__c|0175G0000jIvmN7QAJ|false    
|Visita Cancelada|null                  |a015G00000kpbM3QAI|
   
+------------------+-----------------------+-----------+-------------------+------------------+---------+----------------+----------------------+------------------+
   ```
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 3.1.2
   
   * Storage (HDFS/S3/GCS..) : Local
   
   * Running on Docker? (yes/no) : Yes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jasondavindev opened a new issue, #5469: [SUPPORT] Upsert overwrting ordering field with invalid value

Reply via email to