dderjugin opened a new issue, #10879: URL: https://github.com/apache/hudi/issues/10879
I'm writing data using Flink DataStream API to MoR table with partitioning by a column of Long type and PK as Timestamp field. Spark SNAPSHOT query gives an incorrect value of the latest records: +21971-04-23 04:46:37 instead of 1990-01-01 07:51:09.997 Read optimized query gives a correct result. **To Reproduce** 1. Push data to MOR table using Flink 1.17.2 DataStream API. PK is timestamp, partitioning field is Long. 2. Query the table data using Spark snapshot query: "select count(*), min(ts), max(ts) from [table]" 3. The max value of "ts" column is incorrect: +21971-04-23 04:46:37 **Expected behavior** Max "ts" column value should be "1990-01-01 07:51:09.997" Examples: - correct result using read optimized query: +--------+-------------------+-----------------------+ |count(1)|min(ts) |max(ts) | +--------+-------------------+-----------------------+ |5166373 |1989-12-31 23:59:50|1990-01-01 05:47:39.998| +--------+-------------------+-----------------------+ - incorrect result using snapshot query: +--------+-------------------+---------------------+ |count(1)|min(ts) |max(ts) | +--------+-------------------+---------------------+ |6033156 |1989-12-31 23:59:50|+21971-02-25 05:42:13| +--------+-------------------+---------------------+ Detailed query shows that only PK column value is incorrect and the rest of the table columns have expected values. Data in the corresponding parquet file looks correct for all columns including PK column. **Environment Description** * Hudi version : 0.14.1 * Spark version : 3.4.2 * Hive version : 2.3.9 * Hadoop version : 3.3.6 * Storage (HDFS/S3/GCS..) : local filesystem * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
