Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

via GitHub Thu, 21 Mar 2024 00:36:40 -0700


HyukjinKwon commented on code in PR #45621:
URL: https://github.com/apache/spark/pull/45621#discussion_r1533361364



##########
docs/sql-migration-guide.md:
##########
@@ -42,6 +42,7 @@ license: |
 - Since Spark 4.0, the function `to_csv` no longer supports input with the 
data type `STRUCT`, `ARRAY`, `MAP`, `VARIANT` and `BINARY` (because the `CSV 
specification` does not have standards for these data types and cannot be read 
back using `from_csv`), Spark will throw 
`DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception.
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert 
Postgres TIMESTAMP WITH TIME ZONE and TIME WITH TIME ZONE data types to 
TimestampNTZType, which is available in Spark 3.5. 
 - Since Spark 4.0, JDBC read option `preferTimestampNTZ=true` will not convert 
MySQL TIMESTAMP to TimestampNTZType, which is available in Spark 3.5. MySQL 
DATETIME is not affected.
+- Since Spark 4.0, the SQL config 
`spark.sql.parquet.inferTimestampNTZ.enabled` is turned off by default. 
Consequently, when reading Parquet files that were not produced by Spark, the 
Parquet reader will no longer automatically recognize data as the TIMESTAMP_NTZ 
data type. This change ensures backward compatibility with releases of Spark 
version 3.2 and earlier. It also aligns the behavior of schema inference for 
Parquet files with that of other data sources such as CSV, JSON, ORC, and JDBC, 
enhancing consistency across the data sources. To revert to the previous 
behavior where TIMESTAMP_NTZ types were inferred, set 
`spark.sql.parquet.inferTimestampNTZ.enabled` to true.

Review Comment:
   Just my two cents. Before reading this PR just now, I actually didn't know 
we're inferring `TIMESTAMP_NTZ` by default (although it's only when reading the 
Parquet files written not by Spark). Unless we want to do this for the whole 
(e.g., switching Timestamp to TimestampNTZ everywhere), I think we should have 
not enabled this by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47493][SQL] Disable spark.sql.parquet.inferTimestampNTZ.enabled by default [spark]

Reply via email to