Frogglet opened a new issue #3965: URL: https://github.com/apache/hudi/issues/3965
**Describe the problem you faced** This issue seems distinct from issue https://github.com/apache/hudi/issues/3429 because it involves loss of millisecond precision, not just microsecond precision. We are using the Spark-hive integration on EMR. Initially we are running a bulk insert to get the data in place. Among the columns are a few timestamp columns with millisecond precision. After the bulk insert all is well. Then we run an upsert on the latest partition. After performing the upsert, the resulting data has some issues. In the rows that were updated/inserted, all is still well, but rows that were not touched by the upsert now have their timestamp columns truncated all the way down to seconds instead of milliseconds. We are setting the `CLEANER_FILE_VERSIONS_RETAINED_PROP` setting to `1`, to ensure that only the most recent file is in place, so it appears whenever the old data from the old file is brought over, these timestamp columns are being truncated for some reason. **To Reproduce** Steps to reproduce the behavior: 1. Bulk insert data with millisecond timestamp columns into a CoW table. 2. Run an upsert that doesn't touch every row in the partition. 3. Observe that rows that weren't touched by the upsert now have their timestamps truncated to the second instead of millisecond **Expected behavior** Expected behavior should be to retain the millisecond precision of the timestamp columns **Environment Description** * Hudi version : 0.8.0-amzn-0 * Spark version : 3.1.2 * Hive version : 3.1.2 * Hadoop version : 3.2.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Some possibly relevant general settings: ``` DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED_PROP -> "1", HoodieCompactionConfig.CLEANER_POLICY_PROP -> HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name() ``` Settings specific to the initial bulk insert: ``` .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.ENABLE_ROW_WRITER_OPT_KEY, "true") .option(HoodieWriteConfig.BULKINSERT_SORT_MODE, BulkInsertSortMode.PARTITION_SORT.name()) .mode(SaveMode.Overwrite) .save(outputLocation) ``` Settings specific to the upsert: ``` .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .mode(SaveMode.Append) .save(outputLocation) ``` `enableHiveSupport` is being called on the spark session builder. `'spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED'` is being set in configuration of the EMR cluster to avoid Spark 3.0 errors on timestamps from before 1900. Any help on how to avoid this issue would be greatly appreciated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
