andygrove commented on PR #4256:
URL: 
https://github.com/apache/datafusion-comet/pull/4256#issuecomment-4407029035

   While going through this with `/review-comet-pr` I dropped a small set of 
edge-case SQL test files under 
`spark/src/test/resources/sql-tests/expressions/datetime/edge_cases/` and ran 
them against the PR locally with `-Pspark-4.1`. Posting the results in case 
they're useful.
   
   **8 of 10 edge-case queries diverge from Spark.** The fractional-digit 
truncation cases work fine. All eight failures are leading-whitespace variants:
   
   | Query | Spark | This PR |
   | --- | --- | --- |
   | `to_time(' 12:30:45')` | `12:30:45` | `CometNativeException: cannot be 
parsed to a TIME value` |
   | `to_time('  12:30:45')` | `12:30:45` | same exception |
   | `to_time(' 12:30:45 ')` | `12:30:45` | same exception |
   | `to_time(' 1:00:00 PM')` | `13:00` | same exception |
   | `to_time('\t12:30:45')` | `12:30:45` | same exception |
   | `to_time('\n12:30:45')` | `12:30:45` | same exception |
   | `try_to_time(' 12:30:45')` | `12:30:45` | `null` |
   | `try_to_time(' 1:00:00 PM')` | `13:00` | `null` |
   
   Passing edge cases (both engines agree):
   
   - `to_time('00:00:00.1234567')` and `to_time('12:30:45.999999999')`. 
Fractional truncation past microseconds works correctly.
   
   **Root cause.** `string_to_time` in 
`native/spark-expr/src/datetime_funcs/to_time.rs` only does `s.trim_end()`. 
Spark's path is `stringToTime` then `parseTimestampString` 
(`SparkDateTimeUtils.scala:493`), and `parseTimestampString` trims both ends 
with `getTrimmedStart` / `getTrimmedEnd`, which use 
`UTF8String.isWhitespaceOrISOControl` so tabs and newlines are stripped too. A 
`trim` at the top of `string_to_time` (covering ASCII whitespace and ISO 
control bytes) should close the gap.
   
   The unit test `test_leading_space_with_t_prefix` in `to_time.rs` is correct 
in result for `" T12:30"` but its comment `Spark only right-trims` is 
misleading: Spark does an additional left-trim inside `parseTimestampString`, 
and that input is rejected because the `T`-prefix branch is gated on `j == 0` 
rather than because of trimming. Worth rewording so future readers don't follow 
the trail.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to