Re: [PR] [SPARK-57572][SQL] Infer TimeType during CSV and JSON schema inference [spark]

via GitHub Sat, 20 Jun 2026 13:16:08 -0700


MaxGekk commented on code in PR #56634:
URL: https://github.com/apache/spark/pull/56634#discussion_r3447449563



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JsonInferSchemaSuite.scala:
##########
@@ -120,4 +120,19 @@ class JsonInferSchemaSuite extends SparkFunSuite with 
SQLHelper {
       """{"a": "2884-06-24T02:45:51.138"}""",
       StringType)
   }
+
+  test("SPARK-57572: infer TimeType when timeType.enabled is true") {
+    withSQLConf(SQLConf.TIME_TYPE_ENABLED.key -> "true") {
+      checkType(Map.empty[String, String], """{"a": "12:13:14"}""",
+        TimeType(TimeType.MICROS_PRECISION))
+      checkType(Map.empty[String, String], """{"a": "23:59:59.123456"}""",

Review Comment:
   The enabled-case test only checks positive values. Consider adding a 
negative case — e.g. a timestamp/date string like `2884-06-24T02:45:51.138` 
should still infer as `StringType`/timestamp, not `TimeType` — mirroring the 
CSV enabled test's `2023-01-01` / `not-a-time` checks.



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala:
##########
@@ -273,4 +273,38 @@ class CSVInferSchemaSuite extends SparkFunSuite with 
SQLHelper {
     val inferSchema = new CSVInferSchema(options)
     assert(inferSchema.inferField(NullType, "2884-06-24T02:45:51.138") == 
StringType)
   }
+
+  test("SPARK-57572: infer TimeType when timeType.enabled is true") {
+    withSQLConf(SQLConf.TIME_TYPE_ENABLED.key -> "true") {
+      val options = new CSVOptions(Map.empty[String, String],
+        columnPruning = false, defaultTimeZoneId = "UTC")
+      val inferSchema = new CSVInferSchema(options)
+
+      // Basic time inference
+      assert(inferSchema.inferField(NullType, "12:13:14") == 
TimeType(TimeType.MICROS_PRECISION))
+      assert(inferSchema.inferField(NullType, "23:59:59") == 
TimeType(TimeType.MICROS_PRECISION))
+      assert(inferSchema.inferField(NullType, "00:00:00") == 
TimeType(TimeType.MICROS_PRECISION))
+
+      // Time with fractional seconds
+      assert(inferSchema.inferField(NullType, "12:13:14.123") ==
+        TimeType(TimeType.MICROS_PRECISION))
+
+      // Not a time -- should fall through to other types
+      assert(inferSchema.inferField(NullType, "not-a-time") == StringType)
+
+      // Date should NOT be inferred as time
+      assert(inferSchema.inferField(NullType, "2023-01-01") == DateType)
+    }
+  }
+
+  test("SPARK-57572: TimeType not inferred when timeType.enabled is false") {
+    withSQLConf(SQLConf.TIME_TYPE_ENABLED.key -> "false") {
+      val options = new CSVOptions(Map.empty[String, String],
+        columnPruning = false, defaultTimeZoneId = "UTC")
+      val inferSchema = new CSVInferSchema(options)
+
+      // Time string should fall through to StringType when disabled
+      assert(inferSchema.inferField(NullType, "12:13:14") != 
TimeType(TimeType.MICROS_PRECISION))

Review Comment:
   `!=` passes for any non-Time type, so it doesn't pin down the behavior — and 
the comment is inaccurate: with the flag off, `12:13:14` does not fall through 
to `StringType`. The lenient default timestamp parser accepts time-only strings 
(`justTime`, filled with `LocalDate.now`), so it infers as `TimestampType`. 
Assert the real type and fix the comment:
   ```suggestion
         // When disabled, a time-only string is not inferred as TimeType; the 
lenient timestamp
         // parser accepts it (using the current date), so it infers as 
TimestampType.
         assert(inferSchema.inferField(NullType, "12:13:14") == TimestampType)
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala:
##########
@@ -176,6 +178,9 @@ class JsonInferSchema(private val options: JSONOptions) 
extends Serializable wit
         }
         if (options.prefersDecimal && decimalTry.isDefined) {
           decimalTry.get
+        } else if (isTimeTypeEnabled &&

Review Comment:
   JSON time inference runs whenever `timeType.enabled` is on, but JSON 
*timestamp* inference is gated by `inferTimestamp` (default `false`). With the 
default, this means a string column infers as `TimeType` while never inferring 
`TimestampType`, and every string field pays a `timeFormatter.parse` attempt — 
the per-string temporal-parse cost that `inferTimestamp=false` exists to avoid 
(SPARK-32130). CSV avoids both by tying time to its existing `preferDate` 
opt-in. Consider moving this check inside the `if (options.inferTimestamp)` 
block so time requires the same temporal opt-in as timestamp, or note in the PR 
why JSON time should be independent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57572][SQL] Infer TimeType during CSV and JSON schema inference [spark]

Reply via email to