SubashRanganathan opened a new issue, #6305:
URL: https://github.com/apache/hudi/issues/6305

   Hudi Delta Streamer unable to read dates that are older than older than 
1900-01-01.The workaround fix for this is to set the following spark 
configurations :
   
   spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf 
spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf 
spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf 
spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED.
   
   This options work fine when I try to create Hudi table with PySpark. 
However, when I run CDC process with DeltaStreamer I still continue to get this 
error. Please note that I cannot use the hudi- transformer class becuase for 
transformer class to be applied, delta streamer should read the source files. 
Delta streamer is not able to read the source files.
   
   The error message is "An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
   You may get a different result due to the upgrading of Spark 3.0: reading 
dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet 
INT96 files can be ambiguous, 
   as the files may be written by Spark 2.x or legacy versions of Hive, which 
uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic 
Gregorian calendar.
   See more details in SPARK-31404.
   You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to 
rebase the datetime values w.r.t. the calendar difference during reading. 
   Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read 
the datetime values as it is


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to