clee704 opened a new issue, #5524:
URL: https://github.com/apache/incubator-gluten/issues/5524
### Backend
VL (Velox)
### Bug description
Velox evaluates `date_format(timestamp'12345-01-01 01:01:01', 'yyyy-MM')` to
`'12345-07'`, whereas vanilla Spark evaluates the same expression to
`'+12345-07'`. This can be an issue because `unix_timestamp` in vanilla Spark
only supports `'+12345-07'`. If `date_format` is executed in Velox and the
result is used as an argument to `unix_timestamp` in vanilla Spark, there will
be a failure.
```scala
// Somehow CREATE TABLE doesn't work with five-digit year timestamps
spark.sql("select timestamp'12345-01-01 01:01:01'
c").write.mode("overwrite").save("x")
spark.read.load("x").createOrReplaceTempView("t")
// date_format is run in Velox
spark.sql("select date_format(c, 'yyyy-MM') from t").explain()
// == Physical Plan ==
// VeloxColumnarToRowExec
// +- ^(14) ProjectExecTransformer [date_format(c#83, yyyy-MM,
Some(Etc/UTC)) AS date_format(c, yyyy-MM)#85]
// +- ^(14) NativeFileScan parquet [c#83] Batched: true, DataFilters: [],
Format: Parquet, Location: InMemoryFileIndex(1
paths)[file:/ssd/chungmin/repos/spark34/x], PartitionFilters: [],
PushedFilters: [], ReadSchema: struct<c:timestamp>
// Use collect() instead of show(), as show() makes the function run in
vanilla Spark in Spark 3.5 due to the inserted ToPrettyString function.
spark.sql("select date_format(c, 'yyyy-MM') from t").collect()
// Array([12345-01])
spark.sql("create table t2 as select date_format(c, 'yyyy-MM') c from t")
spark.sql("set spark.gluten.enabled = false")
spark.sql("select unix_timestamp(c, 'yyyy-MM') from t2").collect()
// 24/04/25 02:01:01 ERROR TaskResources: Task 8 failed by error:
// org.apache.spark.SparkUpgradeException:
[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get
a different result due to the upgrading to Spark >= 3.0:
// Fail to parse '12345-01' in the new parser. You can set
"spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before
Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
// ...
```
Spark uses `java.time.format.DateTimeFormatter`. Strangely, the plus sign is
only there when the format is "yyyy" and the year is five digits.
```scala
import java.time.{LocalDate, ZoneId}
import java.time.format.DateTimeFormatter
DateTimeFormatter.ofPattern("yyyy").withZone(ZoneId.of("Z")).format(LocalDate.of(12345,
1, 1))
// "+12345"
```
Five-digit years should be extremely rare in real world applications, but
it's breaking Delta unit tests.
The issue occurs with Spark 3.4.2 and 3.5.1. Didn't check older versions.
### Spark version
None
### Spark configurations
spark.plugins=org.apache.gluten.GlutenPlugin
spark.gluten.enabled=true
spark.gluten.sql.columnar.backend.lib=velox
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=28g
### System information
Velox System Info v0.0.2
Commit: 45dc46a9dd8a4197876da4c661d856f73d31673f
CMake Version: 3.28.3
System: Linux-6.5.0-1018-azure
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.0
CMake Prefix Path:
/usr/local;/usr;/;/ssd/linuxbrew/.linuxbrew/Cellar/cmake/3.28.3;/usr/local;/usr/X11R6;/usr/pkg;/opt
### Relevant logs
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]