[GitHub] [hudi] satyasinha-94 opened a new issue, #9151: [SUPPORT] Deltastreamer throws exception when ingesting INT96 timestamps

via GitHub Fri, 07 Jul 2023 17:08:08 -0700


satyasinha-94 opened a new issue, #9151:
URL: https://github.com/apache/hudi/issues/9151


   **Describe the problem you faced**
   
   Deltastreamer throws an exception when trying to ingest a ParquetDFSSource 
with INT96 timestamps. 
   
   After adding `"spark.sql.legacy.parquet.int96RebaseModeInRead": "CORRECTED"` 
and `"spark.sql.legacy.avro.datetimeRebaseModeInWrite": "CORRECTED"` as configs 
the ingest succeeds only if pointed at a source.dfs.root with a single parquet 
file, attempting to ingest multiple parquet files throws the above error.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Run Deltastreamer on a ParquetDFSSource with a single file that have 
INT96 timestamps with no legacy datetime rebase configs
   2. Observe error
   3. Rerun with legacy configs enabled
   4. Run Deltastreamer again with the legacy configs, but with multiple 
parquet files with INT96 timestamps.
   
   **Expected behavior**
   
   Enabling the two legacy configs should allow for the ingest of all parquet 
files at the source.dfs.root that contain INT96 timestamps.
   
   **Environment Description**
   
   Hudi version : 13
   
   Spark version : 3.1
   
   Hive version : N/A
   
   Hadoop version : N/A
   
   Storage (HDFS/S3/GCS..) : S3
   
   Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   Issue mentioned previously [here](https://github.com/apache/hudi/issues/6278)
   
   Comment mentioning issue ingesting multiple parquet files with legacy 
datetime rebase configs enabled 
[here](https://github.com/apache/hudi/issues/6278#issuecomment-1356756712) 
   
   **Stacktrace**
   
   ```org.apache.spark.SparkUpgradeException: You may get a different result 
due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or 
timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be 
ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, 
which uses a legacy hybrid calendar that is different from Spark 3.0+'s 
Proleptic Gregorian calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] satyasinha-94 opened a new issue, #9151: [SUPPORT] Deltastreamer throws exception when ingesting INT96 timestamps

Reply via email to