MaxGekk commented on pull request #31552: URL: https://github.com/apache/spark/pull/31552#issuecomment-778562133
@dongjoon-hyun Thank you for your quick response. > Actually, this seems to reduce an existing test coverage for all data sources ... Yep, it does. I should highlight that the "test all data types" test checks end-to-end scenario, and if there are any conversions between calendars Julian <-> Gregorian, the test fails on some seeds. That's why we forcibly set the rebasing mode to `CORRECTED` for Avro and Parquet in the test, see https://github.com/apache/spark/blob/ba13b94f6b2b477a93c0849c1fc776ffd5f1a0e6/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala#L157-L159 to avoid the dates that don't exist in one of the calendars. At the same time, ORC is tested in the "LEGACY" mode in fact, where we perform datetime rebasing between calendars. So, if we would enable "LEGACY" for Avro or Parquet, they will fail as well. > Do you think we can narrow down to ORC only? We can exclude some date ranges like 1582-10-05 .. 1582-10-15 + 29 Feb in some leap years. In that case, we can test Avro/Parquet in the "LEGACY" mode too (and remove the SQL config settings showed above). For me, the case of ORC's date (and timestamps too) seems similar to Parquet's INT96 timestamps. The ORC spec says nothing about the calendar systems (https://orc.apache.org/specification/ORCv2/), and it just mentions the offset in days from the epoch: _" Date data is encoded with a PRESENT stream, a DATA stream that records **the number of days after January 1, 1970 in UTC** "_ Since DATE just stores as number of days, the calendar system is not "hard coded" in the spec. I think we should support the **"CORRECTED"** mode (via a SQL config or/and a DS option) in the ORC datasource too as we did that recently for Parquet INT96 in https://github.com/apache/spark/pull/30056. @cloud-fan @bart-samwel WDYT? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
