Ryan Pifer created HUDI-2971:
--------------------------------

             Summary: Timestamp values being corrupted when using BULK INSERT 
with row writing enabled
                 Key: HUDI-2971
                 URL: https://issues.apache.org/jira/browse/HUDI-2971
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Ryan Pifer


We found that after performing bulk inserts with data that included Timestamps 
that after performing other write operations on the table that the Timestamps 
of records from the initial load were all corrupted. We narrowed this down to 
when row writing is enabled which uses Spark Datasource V2. In Hudi 0.9.0 row 
writing is enabled by default.

Performing 2 inserts on new table `ts_ts` match in both records (expected 
results)
{code:java}
scala> 
spark.read.format("hudi").load("s3://ryanpife-emr-dev/hudi/data/hudi090/timestamp/2/").show()
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|version|partition|          ts_string|              
ts_ts|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+
|     20211022233434|  20211022233434_0_1|               101|                  
2019|0db6c29d-5291-4f7...|101|      1|     2019|2021-05-07 00:00:00|2021-05-07 
00:00:00|
|     20211022233556|  20211022233556_0_1|               102|                  
2019|0db6c29d-5291-4f7...|102|      2|     2019|2021-05-07 00:00:00|2021-05-07 
00:00:00|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+
{code}
 

Performing bulk insert, then insert `ts_ts` do not match in records (corrupted 
result)
{code:java}
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|version|partition|          ts_string|               
ts_ts|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+
|     20211022232152|  20211022232152_0_1|               104|                  
2019|dbdc2dd9-e870-4cf...|104|      4|     2019|2021-05-07 00:00:00|1970-01-19 
18:05:...|
|     20211022232441|  20211022232441_0_1|               105|                  
2019|dbdc2dd9-e870-4cf...|105|      5|     2019|2021-05-07 00:00:00| 2021-05-07 
00:00:00|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to