SEPURI-SAI-KRISHNA opened a new pull request, #56746:
URL: https://github.com/apache/spark/pull/56746
### What changes were proposed in this pull request?
This PR fixes `slice(array, start, length)` returning an empty array
(dropping all
elements) when `length` is large enough that `start_0based + length`
overflows a
32-bit int in the interpreted evaluation path.
`Slice.nullSafeEval` computed the result as:
data.slice(startIndex, startIndex + lengthInt)
For a large `length`, `startIndex + lengthInt` overflows to a negative
`until`, and
under Scala 2.13 `Seq.slice` with a negative `until` returns an empty
result, so the
whole tail is silently dropped. The codegen path already routes through
`ArrayExpressionUtils.sliceLength`, which clamps the length to the number
of elements
remaining after `startIndex`, so the interpreted and codegen paths
disagreed.
The fix routes the interpreted path through the same
`ArrayExpressionUtils.sliceLength`
helper that codegen uses (introduced in SPARK-57171). `sliceLength`
rejects a negative
length and clamps to `numElements - startIndex`, so `startIndex +
resLength` can no
longer overflow and both execution paths produce identical results.
### Why are the changes needed?
`slice` silently returns wrong results (data loss) instead of the correct
sub-array.
For constant arguments the wrong value is produced even by default, because
`ConstantFolding` evaluates the expression through the interpreted
`eval()` at plan
time. Example on released Spark 4.0.3 (Scala 2.13.16):
spark.sql("SELECT slice(array(1,2,3,4,5,6), 2,
2147483647)").show(false)
-- returns [] (expected [2, 3, 4, 5, 6])
The interpreted path is also used when whole-stage codegen falls back at
runtime
(e.g. when the generated method exceeds
`spark.sql.codegen.hugeMethodLimit`), so the
wrong result can surface without any internal configuration being set.
(Spark 3.5 / Scala 2.12 is unaffected: 2.12's `slice` double-overflows
`until - lo`
back to a positive count and accidentally returns the correct elements.)
### Does this PR introduce _any_ user-facing change?
Yes. `slice` with a large `length` now returns the correct tail of the
array in the
interpreted path, matching the codegen path, instead of an empty array.
- Before (Spark 4.0+, interpreted/constant-folded):
`slice(array(1,2,3,4,5,6), 2, 2147483647)` => `[]`
- After:
`slice(array(1,2,3,4,5,6), 2, 2147483647)` => `[2, 3, 4, 5, 6]`
### How was this patch tested?
Added regression cases to `CollectionExpressionsSuite` ("Slice" test).
`checkEvaluation`
exercises both the interpreted and codegen paths, so the new cases fail on
the old code
(interpreted returned `[]`) and pass with the fix:
checkEvaluation(Slice(a0, Literal(2), Literal(Int.MaxValue)), Seq(2,
3, 4, 5, 6))
checkEvaluation(Slice(a0, Literal(1), Literal(Int.MaxValue)), Seq(1,
2, 3, 4, 5, 6))
Ran the full suite locally:
build/sbt 'catalyst/testOnly
org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite'
-- Tests: succeeded 61, failed 0
Also reproduced the original bug on a released Spark 4.0.3 (Scala 2.13.16)
`spark-shell`
and confirmed the corrected behavior with the fix.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.8)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]