[I] [SUPPORT] Spark snapshot query against MOR table data written by Flink gives an incorrect timestamp [hudi]

via GitHub Mon, 18 Mar 2024 09:29:39 -0700


dderjugin opened a new issue, #10879:
URL: https://github.com/apache/hudi/issues/10879


   I'm writing data using Flink DataStream API to MoR table with partitioning 
by a column of Long type and PK as Timestamp field.
   Spark SNAPSHOT query gives an incorrect value of the latest records: 
+21971-04-23 04:46:37 instead of 1990-01-01 07:51:09.997
   Read optimized query gives a correct result.
   
   **To Reproduce**
   
   1. Push data to MOR table using Flink 1.17.2 DataStream API. PK is 
timestamp, partitioning field is Long.
   2. Query the table data using Spark snapshot query: "select count(*), 
min(ts), max(ts) from [table]"
   3. The max value of "ts" column is incorrect: +21971-04-23 04:46:37
   
   **Expected behavior**
   Max "ts" column value should be "1990-01-01 07:51:09.997"
   
   Examples:
   - correct result using read optimized query:
   +--------+-------------------+-----------------------+
   |count(1)|min(ts)            |max(ts)                |
   +--------+-------------------+-----------------------+
   |5166373 |1989-12-31 23:59:50|1990-01-01 05:47:39.998|
   +--------+-------------------+-----------------------+
   
   - incorrect result using snapshot query:
   +--------+-------------------+---------------------+
   |count(1)|min(ts)            |max(ts)              |
   +--------+-------------------+---------------------+
   |6033156 |1989-12-31 23:59:50|+21971-02-25 05:42:13|
   +--------+-------------------+---------------------+
   
   Detailed query shows that only PK column value is incorrect and the rest of 
the table columns have expected values. 
   Data in the corresponding parquet file looks correct for all columns 
including PK column.
   
   **Environment Description**
   
   * Hudi version : 0.14.1
   
   * Spark version : 3.4.2
   
   * Hive version : 2.3.9
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) : local filesystem
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Spark snapshot query against MOR table data written by Flink gives an incorrect timestamp [hudi]

Reply via email to