[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

GitBox Tue, 12 Jul 2022 13:45:51 -0700


bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r919399248



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -148,7 +148,28 @@ class CSVOptions(
   // A language tag in IETF BCP 47 format
   val locale: Locale = 
parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)
 
-  val dateFormatInRead: Option[String] = parameters.get("dateFormat")
+  /**
+   * Infer columns with all valid date entries as date type (otherwise 
inferred as timestamp type).
+   * Disabled by default for backwards compatibility and performance. When 
enabled, date entries in
+   * timestamp columns will be cast to timestamp upon parsing. Not compatible 
with
+   * legacyTimeParserPolicy == LEGACY since legacy date parser will accept 
extra trailing characters
+   */
+  val inferDate = {
+    val inferDateFlag = getBool("inferDate")
+    if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && 
inferDateFlag) {

Review Comment:
   >to always use the non-legacy parser for inference and allowing 
inferDate=true with legacyTimeParserPolicy = LEGACY. What do you think?
   
   It's a little weird that you can specify `inferDate=true` with 
`legacyTimeParserPolicy=LEGACY`, yet Spark won't properly infer the type of 
legacy dates, e.g.:
   ```
   scala> sql("set spark.sql.legacy.timeParserPolicy=legacy")
   res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
   
   scala> // 1500-02-29 is a legacy date
   
   scala> val csvInput = Seq("1425-03-22", "2022-01-01", "1500-02-29").toDS
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> // infers as string
   
   scala> spark.read.options(Map("inferSchema" -> "true", "inferDate" -> 
"true", "dateFormat" -> "yyyy-MM-dd")).csv(csvInput).printSchema
   root
    |-- _c0: string (nullable = true)
   
   
   scala> // if you specify a schema, it parses just fine, although 1500-02-29 
becomes 1500-03-01 (expected)
   
   scala> spark.read.schema("dt date").options(Map("dateFormat" -> 
"yyyy-MM-dd")).csv(csvInput).show(false)
   +----------+
   |dt        |
   +----------+
   |1425-03-22|
   |2022-01-01|
   |1500-03-01|
   +----------+
   
   scala> // remove the legacy date from the input...
   
   scala> val csvInput = Seq("1425-03-22", "2022-01-01").toDS
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> // .. and *then* Spark will infer date type
   
   scala> spark.read.options(Map("inferSchema" -> "true", "inferDate" -> 
"true", "dateFormat" -> "yyyy-MM-dd")).csv(csvInput).printSchema
   root
    |-- _c0: date (nullable = true)
   ```
   If the user doesn't have any legacy dates in their input, it will work just 
fine, but then I am not sure why the user would be specifying 
`legacyTimeParserPolicy=LEGACY`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bersprockets commented on a diff in pull request #36871: [SPARK-39469][SQL] Infer date type for CSV schema inference

Reply via email to