[GitHub] [orc] sinkinben opened a new issue, #1237: The result is strange when casting `string` to `date` in ORC reading via spark.

GitBox Sun, 28 Aug 2022 01:55:26 -0700


sinkinben opened a new issue, #1237:
URL: https://github.com/apache/orc/issues/1237


   I created an ORC file by the code as follows.
   ```scala
   val data = Seq(
       ("", "2022-01-32"),  // pay attention to this, null
       ("", "9808-02-30"),  // pay attention to this, 9808-02-29
       ("", "2022-06-31"),  // pay attention to this, 2022-06-30
   
   )
   val cols = Seq("str", "date_str")
   val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1)
   df.printSchema()
   df.show(100)
   df.write.mode("overwrite").orc("/tmp/orc/data.orc")
   ```
   Please note that these three cases are invalid date.
   And I read it via:
   ```shell
   scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); 
df.show()
   +----------+
   |  date_str|
   +----------+
   |      null|
   |9808-02-29|
   |2022-06-30|
   +----------+
   ```
   
   Why is `2022-01-31` converted to `null`, while `9808-02-30` is converted to 
`9808-02-29`?
   
   Intuitively, they are invalid date, we should return 3 nulls.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] sinkinben opened a new issue, #1237: The result is strange when casting `string` to `date` in ORC reading via spark.

Reply via email to