clee704 opened a new issue, #5524:
URL: https://github.com/apache/incubator-gluten/issues/5524

   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   Velox evaluates `date_format(timestamp'12345-01-01 01:01:01', 'yyyy-MM')` to 
`'12345-07'`, whereas vanilla Spark evaluates the same expression to 
`'+12345-07'`. This can be an issue because `unix_timestamp` in vanilla Spark 
only supports `'+12345-07'`. If `date_format` is executed in Velox and the 
result is used as an argument to `unix_timestamp` in vanilla Spark, there will 
be a failure.
   
   ```scala
   // Somehow CREATE TABLE doesn't work with five-digit year timestamps
   spark.sql("select timestamp'12345-01-01 01:01:01' 
c").write.mode("overwrite").save("x")
   spark.read.load("x").createOrReplaceTempView("t")
   
   // date_format is run in Velox
   spark.sql("select date_format(c, 'yyyy-MM') from t").explain()
   // == Physical Plan ==
   // VeloxColumnarToRowExec
   // +- ^(14) ProjectExecTransformer [date_format(c#83, yyyy-MM, 
Some(Etc/UTC)) AS date_format(c, yyyy-MM)#85]
   //    +- ^(14) NativeFileScan parquet [c#83] Batched: true, DataFilters: [], 
Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/ssd/chungmin/repos/spark34/x], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<c:timestamp>
   
   // Use collect() instead of show(), as show() makes the function run in 
vanilla Spark in Spark 3.5 due to the inserted ToPrettyString function.
   spark.sql("select date_format(c, 'yyyy-MM') from t").collect()
   // Array([12345-01])
   
   spark.sql("create table t2 as select date_format(c, 'yyyy-MM') c from t")
   spark.sql("set spark.gluten.enabled = false")
   spark.sql("select unix_timestamp(c, 'yyyy-MM') from t2").collect()
   // 24/04/25 02:01:01 ERROR TaskResources: Task 8 failed by error:
   // org.apache.spark.SparkUpgradeException: 
[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get 
a different result due to the upgrading to Spark >= 3.0:
   // Fail to parse '12345-01' in the new parser. You can set 
"spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before 
Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
   // ...
   ```
   
   Spark uses `java.time.format.DateTimeFormatter`. Strangely, the plus sign is 
only there when the format is "yyyy" and the year is five digits.
   
   ```scala
   import java.time.{LocalDate, ZoneId}
   import java.time.format.DateTimeFormatter
   
   
DateTimeFormatter.ofPattern("yyyy").withZone(ZoneId.of("Z")).format(LocalDate.of(12345,
 1, 1))
   // "+12345"
   ```
   
   Five-digit years should be extremely rare in real world applications, but 
it's breaking Delta unit tests.
   
   The issue occurs with Spark 3.4.2 and 3.5.1. Didn't check older versions.
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   spark.plugins=org.apache.gluten.GlutenPlugin
   spark.gluten.enabled=true
   spark.gluten.sql.columnar.backend.lib=velox
   spark.memory.offHeap.enabled=true
   spark.memory.offHeap.size=28g
   
   ### System information
   
   Velox System Info v0.0.2
   Commit: 45dc46a9dd8a4197876da4c661d856f73d31673f
   CMake Version: 3.28.3
   System: Linux-6.5.0-1018-azure
   Arch: x86_64
   C++ Compiler: /usr/bin/c++
   C++ Compiler Version: 11.4.0
   C Compiler: /usr/bin/cc
   C Compiler Version: 11.4.0
   CMake Prefix Path: 
/usr/local;/usr;/;/ssd/linuxbrew/.linuxbrew/Cellar/cmake/3.28.3;/usr/local;/usr/X11R6;/usr/pkg;/opt
   
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to