bart-samwel commented on pull request #31552:
URL: https://github.com/apache/spark/pull/31552#issuecomment-779141105


   > For me, the case of ORC's date (and timestamps too) seems similar to 
Parquet's INT96 timestamps. The ORC spec says nothing about the calendar 
systems (https://orc.apache.org/specification/ORCv2/), and it just mentions the 
offset in days from the epoch:
   > _" Date data is encoded with a PRESENT stream, a DATA stream that records 
**the number of days after January 1, 1970 in UTC** "_
   > Since DATE just stores as number of days, the calendar system is not "hard 
coded" in the spec. I think we should support the **"CORRECTED"** mode (via a 
SQL config or/and a DS option) in the ORC datasource too as we did that 
recently for Parquet INT96 in #30056. @cloud-fan @bart-samwel WDYT?
   
   That makes sense. At least then we can store the data that gets generated 
internally and read it back. It would take some work for backward compatibility 
just like for Parquet -- e.g. we'd have to add metadata to the ORC files, and 
if that's not present, we'd need to detect which system wrote the file and base 
the read rebasing decision on that.
   
   FWIW, I think the data generator limitations should be explicitly tweaked 
for the tests to match the expectations of the test. I.e., if we expect the 
test won't handle some kind of date correctly, *then and only then* do we turn 
those off.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to