jorgecarleitao commented on issue #1360:
URL: 
https://github.com/apache/arrow-datafusion/issues/1360#issuecomment-979430852


   fwiw, both spark and pyarrow give the wrong result in different ways.
   
   ## pyarrow
   
   ```python
   import pyarrow.parquet
   
   path = "data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet"
   
   table = pyarrow.parquet.read_table(path)
   print(table["dimension_load_date"])
   ```
   
   ```
   pyarrow.Field<dimension_load_date: timestamp[ns]>
   [
     [
       1816-03-29 05:56:08.066277376,
       1815-03-30 05:56:08.066277376,
       2021-06-09 00:02:37.000000000,
       ...
     ]
   ]
   ```
   
   ## spark
   
   While it provides the correct result in your case, it only reads up to 
microseconds (i.e. it truncates nanoseconds). See [source 
code](https://github.com/apache/spark/blob/HEAD/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L184),
 the `nanos / NANOS_PER_MICROS` truncates nanoseconds. So, it allows 9999-12-31 
just because this date in microseconds happens to still fit in an i64 (but 
other larger ones do not).
   
   I do not think there is a correct answer here: "9999-12-31" is not 
represented by i64 in nanoseconds. Given that int96 original scope was to 
support nanoseconds, pyarrow seems to preserve that behavior. OTOH, to avoid 
crashing, it spits _something_, even if that something is meaningless in this 
context.
   
   Panicking is a bit too harsh, but at least it does not allow you to go back 
to the 19th century xD
   
   Note that int96 [has been 
deprecated](https://issues.apache.org/jira/browse/PARQUET-323).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to