Ryan Pifer created HUDI-2971:
--------------------------------
Summary: Timestamp values being corrupted when using BULK INSERT
with row writing enabled
Key: HUDI-2971
URL: https://issues.apache.org/jira/browse/HUDI-2971
Project: Apache Hudi
Issue Type: Bug
Reporter: Ryan Pifer
We found that after performing bulk inserts with data that included Timestamps
that after performing other write operations on the table that the Timestamps
of records from the initial load were all corrupted. We narrowed this down to
when row writing is enabled which uses Spark Datasource V2. In Hudi 0.9.0 row
writing is enabled by default.
Performing 2 inserts on new table `ts_ts` match in both records (expected
results)
{code:java}
scala>
spark.read.format("hudi").load("s3://ryanpife-emr-dev/hudi/data/hudi090/timestamp/2/").show()
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
_hoodie_file_name| id|version|partition| ts_string|
ts_ts|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+
| 20211022233434| 20211022233434_0_1| 101|
2019|0db6c29d-5291-4f7...|101| 1| 2019|2021-05-07 00:00:00|2021-05-07
00:00:00|
| 20211022233556| 20211022233556_0_1| 102|
2019|0db6c29d-5291-4f7...|102| 2| 2019|2021-05-07 00:00:00|2021-05-07
00:00:00|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+-------------------+
{code}
Performing bulk insert, then insert `ts_ts` do not match in records (corrupted
result)
{code:java}
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
_hoodie_file_name| id|version|partition| ts_string|
ts_ts|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+
| 20211022232152| 20211022232152_0_1| 104|
2019|dbdc2dd9-e870-4cf...|104| 4| 2019|2021-05-07 00:00:00|1970-01-19
18:05:...|
| 20211022232441| 20211022232441_0_1| 105|
2019|dbdc2dd9-e870-4cf...|105| 5| 2019|2021-05-07 00:00:00| 2021-05-07
00:00:00|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---------+-------------------+--------------------+{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)