[GitHub] [hudi] Frogglet opened a new issue #3965: [SUPPORT] Upserts results in truncated milliseconds from timestamps

GitBox Wed, 10 Nov 2021 11:26:24 -0800


Frogglet opened a new issue #3965:
URL: https://github.com/apache/hudi/issues/3965



   **Describe the problem you faced**
   
   This issue seems distinct from issue 
https://github.com/apache/hudi/issues/3429 because it involves loss of 
millisecond precision, not just microsecond precision.
   
   We are using the Spark-hive integration on EMR. Initially we are running a 
bulk insert to get the data in place. Among the columns are a few timestamp 
columns with millisecond precision. After the bulk insert all is well. Then we 
run an upsert on the latest partition. After performing the upsert, the 
resulting data has some issues. In the rows that were updated/inserted, all is 
still well, but rows that were not touched by the upsert now have their 
timestamp columns truncated all the way down to seconds instead of 
milliseconds. We are setting the `CLEANER_FILE_VERSIONS_RETAINED_PROP` setting 
to `1`, to ensure that only the most recent file is in place, so it appears 
whenever the old data from the old file is brought over, these timestamp 
columns are being truncated for some reason.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Bulk insert data with millisecond timestamp columns into a CoW table.
   2. Run an upsert that doesn't touch every row in the partition.
   3. Observe that rows that weren't touched by the upsert now have their 
timestamps truncated to the second instead of millisecond
   
   **Expected behavior**
   
   Expected behavior should be to retain the millisecond precision of the 
timestamp columns
   
   **Environment Description**
   
   * Hudi version : 0.8.0-amzn-0
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Some possibly relevant general settings:
   
   ```
         DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
         HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED_PROP -> "1",
         HoodieCompactionConfig.CLEANER_POLICY_PROP -> 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name()
   ```
   
   Settings specific to the initial bulk insert:
   
   ```
           .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
           .option(DataSourceWriteOptions.ENABLE_ROW_WRITER_OPT_KEY, "true")
           .option(HoodieWriteConfig.BULKINSERT_SORT_MODE, 
BulkInsertSortMode.PARTITION_SORT.name())
           .mode(SaveMode.Overwrite)
           .save(outputLocation)
   ```
   
   
   Settings specific to the upsert:
   
   ```
           .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
           .mode(SaveMode.Append)
           .save(outputLocation)
   ```
   
   
   `enableHiveSupport` is being called on the spark session builder. 
   
   `'spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED'` is being set 
in configuration of the EMR cluster to avoid Spark 3.0 errors on timestamps 
from before 1900.
   
   Any help on how to avoid this issue would be greatly appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Frogglet opened a new issue #3965: [SUPPORT] Upserts results in truncated milliseconds from timestamps

Reply via email to