jorgecarleitao edited a comment on issue #1360:
URL:
https://github.com/apache/arrow-datafusion/issues/1360#issuecomment-979430852
fwiw, both spark and pyarrow give the wrong result in different ways.
## pyarrow
```python
import pyarrow.parquet
path = "data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet"
table = pyarrow.parquet.read_table(path)
print(table["dimension_load_date"])
```
```
pyarrow.Field<dimension_load_date: timestamp[ns]>
[
[
1816-03-29 05:56:08.066277376,
1815-03-30 05:56:08.066277376,
2021-06-09 00:02:37.000000000,
...
]
]
```
## spark
While it provides the correct result in your case, it only reads up to
microseconds (i.e. it truncates nanoseconds). See [source
code](https://github.com/apache/spark/blob/HEAD/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L184),
the `nanos / NANOS_PER_MICROS` truncates nanoseconds. So, it allows 9999-12-31
just because this date in microseconds happens to still fit in an i64 (but
other larger ones do not).
--------------------
I do not think there is a correct answer here: "9999-12-31" is not
represented by i64 in nanoseconds. Given that int96 original scope was to
support nanoseconds, pyarrow seems to preserve that behavior. OTOH, to avoid
crashing, it spits _something_, even if that something is meaningless in this
context.
Panicking is a bit too harsh, but at least it does not allow you to go back
to the 19th century xD
Note that int96 [has been
deprecated](https://issues.apache.org/jira/browse/PARQUET-323).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]