andygrove opened a new issue, #21510:
URL: https://github.com/apache/datafusion/issues/21510
### Describe the bug
The `datafusion-spark` implementation of `substring` does not match Apache
Spark behavior when the negative start position exceeds the string length.
DataFusion-spark clamps to position 1 and returns a full-length result, while
Spark reduces the available length based on how far before position 1 the start
is.
This was discovered by running a PySpark validation script against the
`.slt` test files (see #17045, #21508).
### To Reproduce
The `.slt` test at
`datafusion/sqllogictest/test_files/spark/string/substring.slt` line 138
contains:
```sql
SELECT substring('Spark SQL', -300, 3);
```
The test expects `Spa`, but Apache Spark returns an empty string.
### Expected behavior
`substring` should match Spark's semantics for negative start positions:
| Expression | Spark result | datafusion-spark result |
|---|---|---|
| `substring('Spark SQL', -9, 3)` | `Spa` | `Spa` ✓ |
| `substring('Spark SQL', -10, 3)` | `Sp` | (likely `Spa`) |
| `substring('Spark SQL', -11, 3)` | `S` | (likely `Spa`) |
| `substring('Spark SQL', -12, 3)` | `` (empty) | (likely `Spa`) |
| `substring('Spark SQL', -300, 3)` | `` (empty) | `Spa` ✗ |
Spark's behavior: for negative `start`, the effective position is `len(str)
+ start + 1`. When this position is before 1, the available length is reduced
by the overshoot. When `start + length` doesn't reach position 1, the result is
empty.
### Additional context
The same bug affects `substr` (alias for `substring`). The corresponding
`.slt` test at line 189 also has wrong expected values for the same reason.
The `.slt` expected values at lines 138 and 189 will need to be updated
along with the implementation fix.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]