[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

Pablo Langa Blanco (Jira) Tue, 27 Feb 2024 06:42:21 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821273#comment-17821273
 ]


Pablo Langa Blanco commented on SPARK-47063:
--------------------------------------------

Ok, personally I would prefer the truncation too, I'll make a PR with it and we 
can discuss it there.

> CAST long to timestamp has different behavior for codegen vs interpreted
> ------------------------------------------------------------------------
>
>                 Key: SPARK-47063
>                 URL: https://issues.apache.org/jira/browse/SPARK-47063
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.2
>            Reporter: Robert Joseph Evans
>            Priority: Major
>
> It probably impacts a lot more versions of the code than this, but I verified 
> it on 3.4.2. This also appears to be related to 
> https://issues.apache.org/jira/browse/SPARK-39209
> {code:java}
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> +--------------------+-----------------------------+--------------------+
> |v                   |ts                           |unix_micros(ts)     |
> +--------------------+-----------------------------+--------------------+
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |1990000000          |
> +--------------------+-----------------------------+--------------------+
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> +--------------------+-------------------+---------------+
> |v                   |ts                 |unix_micros(ts)|
> +--------------------+-------------------+---------------+
> |9223372036854775807 |1969-12-31 23:59:59|-1000000       |
> |-9223372036854775808|1970-01-01 00:00:00|0              |
> |0                   |1970-01-01 00:00:00|0              |
> |1990                |1970-01-01 00:33:10|1990000000     |
> +--------------------+-------------------+---------------+
> {code}
> It looks like InMemoryTableScanExec is not doing code generation for the 
> expressions, but the ProjectExec after the repartition is.
> If I disable code gen I get the same answer in both cases.
> {code:java}
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", 
> "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
> +--------------------+-----------------------------+--------------------+
> |v                   |ts                           |unix_micros(ts)     |
> +--------------------+-----------------------------+--------------------+
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |1990000000          |
> +--------------------+-----------------------------+--------------------+
> scala> Seq(Long.MaxValue, Long.MinValue, 0L, 
> 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as 
> ts").selectExpr("*", "unix_micros(ts)").show(false)
> +--------------------+-----------------------------+--------------------+
> |v                   |ts                           |unix_micros(ts)     |
> +--------------------+-----------------------------+--------------------+
> |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
> |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
> |0                   |1970-01-01 00:00:00          |0                   |
> |1990                |1970-01-01 00:33:10          |1990000000          |
> +--------------------+-----------------------------+--------------------+
> {code}
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627]
> Is the code used in codegen, but
> [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687]
> is what is used outside of code gen.
> Apparently `SECONDS.toMicros` truncates the value on an overflow, but the 
> codegen does not.
> {code:java}
> scala> Long.MaxValue
> res11: Long = 9223372036854775807
> scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue)
> res12: Long = 9223372036854775807
> scala> Long.MaxValue * (1000L * 1000L)
> res13: Long = -1000000
> {code}
> Ideally these would be consistent with each other. I personally would prefer 
> the truncation as it feels more accurate, but I am fine either way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47063) CAST long to timestamp has different behavior for codegen vs interpreted

Reply via email to