parthchandra commented on PR #4256:
URL: 
https://github.com/apache/datafusion-comet/pull/4256#issuecomment-4410306347

   > While going through this with `/review-comet-pr` I dropped a small set of 
edge-case SQL test files under 
`spark/src/test/resources/sql-tests/expressions/datetime/edge_cases/` and ran 
them against the PR locally with `-Pspark-4.1`. Posting the results in case 
they're useful.
   > 
   > **8 of 10 edge-case queries diverge from Spark.** The fractional-digit 
truncation cases work fine. All eight failures are leading-whitespace variants:
   > 
   > Query      Spark   This PR
   > `to_time(' 12:30:45')`     `12:30:45`      `CometNativeException: cannot 
be parsed to a TIME value`
   > `to_time(' 12:30:45')`     `12:30:45`      same exception
   > `to_time(' 12:30:45 ')`    `12:30:45`      same exception
   > `to_time(' 1:00:00 PM')`   `13:00` same exception
   > `to_time('\t12:30:45')`    `12:30:45`      same exception
   > `to_time('\n12:30:45')`    `12:30:45`      same exception
   > `try_to_time(' 12:30:45')` `12:30:45`      `null`
   > `try_to_time(' 1:00:00 PM')`       `13:00` `null`
   > Passing edge cases (both engines agree):
   > 
   > * `to_time('00:00:00.1234567')` and `to_time('12:30:45.999999999')`. 
Fractional truncation past microseconds works correctly.
   > 
   > **Root cause.** `string_to_time` in 
`native/spark-expr/src/datetime_funcs/to_time.rs` only does `s.trim_end()`. 
Spark's path is `stringToTime` then `parseTimestampString` 
(`SparkDateTimeUtils.scala:493`), and `parseTimestampString` trims both ends 
with `getTrimmedStart` / `getTrimmedEnd`, which use 
`UTF8String.isWhitespaceOrISOControl` so tabs and newlines are stripped too. A 
`trim` at the top of `string_to_time` (covering ASCII whitespace and ISO 
control bytes) should close the gap.
   > 
   > The unit test `test_leading_space_with_t_prefix` in `to_time.rs` is 
correct in result for `" T12:30"` but its comment `Spark only right-trims` is 
misleading: Spark does an additional left-trim inside `parseTimestampString`, 
and that input is rejected because the `T`-prefix branch is gated on `j == 0` 
rather than because of trimming. Worth rewording so future readers don't follow 
the trail.
   
   Thanks for catching this. Updated and added all the above as test cases


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to