MaxGekk commented on pull request #31552:
URL: https://github.com/apache/spark/pull/31552#issuecomment-778562133


   @dongjoon-hyun Thank you for your quick response. 
   
   > Actually, this seems to reduce an existing test coverage for all data 
sources ...
   
   Yep, it does.  I should highlight that the "test all data types" test checks 
end-to-end scenario, and if there are any conversions  between calendars Julian 
<-> Gregorian, the test fails on some seeds. That's why we forcibly set the 
rebasing mode to `CORRECTED` for Avro and Parquet in the test, see 
https://github.com/apache/spark/blob/ba13b94f6b2b477a93c0849c1fc776ffd5f1a0e6/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala#L157-L159
   to avoid the dates that don't exist in one of the calendars.
   
   At the same time, ORC is tested in the "LEGACY" mode in fact, where we 
perform datetime rebasing between calendars. So, if we would enable "LEGACY" 
for Avro or Parquet, they will fail as well.
   
   > Do you think we can narrow down to ORC only?
   
   We can exclude some date ranges like 1582-10-05 .. 1582-10-15 + 29 Feb in 
some leap years. In that case, we can test Avro/Parquet in the "LEGACY" mode 
too (and remove the SQL config settings showed above).
   
   
   For me, the case of ORC's date (and timestamps too) seems similar to 
Parquet's INT96 timestamps. The ORC spec says nothing about the calendar 
systems (https://orc.apache.org/specification/ORCv2/), and it just mentions the 
offset in days from the epoch:
   _"
   Date data is encoded with a PRESENT stream, a DATA stream that records **the 
number of days after January 1, 1970 in UTC**
   "_
   Since DATE just stores as number of days, the calendar system is not "hard 
coded" in the spec. I think we should support the **"CORRECTED"** mode (via a 
SQL config or/and a DS option) in the ORC datasource too as we did that 
recently for Parquet INT96 in https://github.com/apache/spark/pull/30056. 
@cloud-fan @bart-samwel WDYT? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to