bersprockets commented on code in PR #36871:
URL: https://github.com/apache/spark/pull/36871#discussion_r899496784


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala:
##########
@@ -358,4 +358,19 @@ class UnivocityParserSuite extends SparkFunSuite with 
SQLHelper {
       Map("timestampFormat" -> "invalid", "dateFormat" -> "invalid"), false, 
"UTC")
     check(new UnivocityParser(StructType(Seq.empty), optionsWithPattern))
   }
+
+  test("dates should be parsed correctly in a timestamp column") {
+    def checkDate(dataType: DataType): Unit = {
+      val timestampsOptions =
+        new CSVOptions(Map("timestampFormat" -> "dd/MM/yyyy HH:mm",
+          "timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> 
"dd_MM_yyyy"),

Review Comment:
   >to make sure timestamps are not parsed as date types without conflicting.
   
   That's actually what happens:
   
   Before this PR:
   ```
   scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 
15:00:00").toDS()
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
   df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: timestamp]
   
   scala> df.printSchema
   root
    |-- _c0: integer (nullable = true)
    |-- _c1: timestamp (nullable = true)
   
   scala> 
   ```
   After this PR:
   ```
   scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 
15:00:00").toDS()
   csvInput: org.apache.spark.sql.Dataset[String] = [value: string]
   
   scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
   df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: date]
   
   scala> df.printSchema
   root
    |-- _c0: integer (nullable = true)
    |-- _c1: date (nullable = true)
   
   scala>
   ```
   It looks like some tests fail too, like `CSVInferSchemaSuite`,  and 
`CSVv1Suite`, possibly others (I ran these two suites on my laptop. For some 
reason, the github actions didn't run tests for this PR. Maybe @Jonathancui123 
needs to turn them on in his fork?).
   
   >We should probably 1. add either SQL configuration or an option e.g., 
infersDate
   
   I think you would need something like that: when set, the date formatter 
could use the slower, more strict method of parsing (so "2012-01-01 12:00:00" 
wouldn't parse as a date).
   
   Edit: To do a strict parsing, one might need to use ParsePosition and check 
that the whole date/time value string was consumed.  Even after setting 
lenient=false `SimpleDateFormat.parse` didn't complain about extra characters 
that weren't consumed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to