Re: [PR] [GLUTEN-5102][VL] Support cast date as timestamp in velox [incubator-gluten]

via GitHub Wed, 03 Apr 2024 22:20:50 -0700


PHILO-HE commented on code in PR #5240:
URL: https://github.com/apache/incubator-gluten/pull/5240#discussion_r1550934905



##########
backends-velox/src/test/scala/org/apache/gluten/execution/TestOperator.scala:
##########
@@ -1236,4 +1236,34 @@ class TestOperator extends 
VeloxWholeStageTransformerSuite {
       }
     }
   }
+
+  test("Cast date to string") {
+    withTempPath {
+      path =>
+        Seq("2023-01-01", "2023-01-02", "2023-01-03")
+          .toDF("dateColumn")
+          .select(to_date($"dateColumn", "yyyy-MM-dd").as("dateColumn"))
+          .write
+          .parquet(path.getCanonicalPath)
+        
spark.read.parquet(path.getCanonicalPath).createOrReplaceTempView("view")
+        runQueryAndCompare("SELECT cast(dateColumn as string) from view") {
+          checkGlutenOperatorMatch[ProjectExecTransformer]
+        }
+    }
+  }
+
+  test("Cast date to timestamp") {

Review Comment:
   > Spark's conversion of the date type to timestamp is only supported up to 
the day of the day, and will not cause problems in different time zones.
   spark.sql("select cast(date'2023-01-02 01:01:01' as timestamp) as ts").show
   +-------------------+
   |                 ts          |
   +-------------------+
   |2023-01-02 00:00:00|
   +-------------------+
   
   @dcoliversun, let me help clarify a bit. Actually, timezone matters in 
casting date to timestamp, i.e., date value is adjusted according to the 
configured local timezone during the casting. And in Spark, timestamp is always 
corresponding to UTC+0 timezone.
   The reason for why the above sql returns the result without timezone 
adjusted is, the printed result is produced by implicitly casting timestamp to 
string, where local timezone is also considered. See [spark 
code](https://github.com/apache/spark/blob/v3.3.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L358).
 This can also explain your mentioned phenomenon: "In spark, timestamp behaves 
differently in df.show and df.collect".
   
   We can let Spark write timestamp result to parquet, and then print the 
parquet content to see the difference when different timezone is configured. 
Or, just check the difference of returned timestamp dataframe, as your added 
test does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [GLUTEN-5102][VL] Support cast date as timestamp in velox [incubator-gluten]

Reply via email to