[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on issue #1360: Reading Parquet file with timestamp column with `9999` year results in overflow panic


andrei-ionescu edited a comment on issue #1360:
URL: 
https://github.com/apache/arrow-datafusion/issues/1360#issuecomment-979463480

Here are multiple things to discuss.

Even though `INT96` is deprecated it is not yet removed from Parquet and
still used in Spark, Flink and may other frameworks. By default Spark 3 comes
with the `spark.sql.parquet.outputTimestampType` option set by default to
`INT96` ([see
here](https://spark.apache.org/docs/3.0.0/configuration.html#runtime-sql-configuration)).
There are lots of parquet file created with columns having the `INT96` type
even though they may contain values that fit into `INT64` only because that is
the default setting. I would say that it would be useful to have a consistent
behaviour: support `INT96`, mark it as deprecated and remove it when and if
parquet will remove it.

Regarding the Spark implementation here is a function that returns the
nanos:
https://github.com/apache/spark/blob/HEAD/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L192.
The nanos precision is not discarded in Spark.

Apache Flink maps the `Timestamp` type to `INT96` too:
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/formats/parquet/.
Also, Impala still uses it.

Can you provide the part of the code where the overflow issue happens? I
would like to understand more.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on issue #1360: Reading Parquet file with timestamp column with `9999` year results in overflow panic

Reply via email to