SubashRanganathan opened a new issue, #6305: URL: https://github.com/apache/hudi/issues/6305
Hudi Delta Streamer unable to read dates that are older than older than 1900-01-01.The workaround fix for this is to set the following spark configurations : spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED. This options work fine when I try to create Hudi table with PySpark. However, when I run CDC process with DeltaStreamer I still continue to get this error. Please note that I cannot use the hudi- transformer class becuase for transformer class to be applied, delta streamer should read the source files. Delta streamer is not able to read the source files. The error message is "An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
