SEPURI-SAI-KRISHNA opened a new pull request, #56746:
URL: https://github.com/apache/spark/pull/56746

   ### What changes were proposed in this pull request?
   
     This PR fixes `slice(array, start, length)` returning an empty array 
(dropping all
     elements) when `length` is large enough that `start_0based + length` 
overflows a
     32-bit int in the interpreted evaluation path.
   
     `Slice.nullSafeEval` computed the result as:
   
         data.slice(startIndex, startIndex + lengthInt)
   
     For a large `length`, `startIndex + lengthInt` overflows to a negative 
`until`, and
     under Scala 2.13 `Seq.slice` with a negative `until` returns an empty 
result, so the
     whole tail is silently dropped. The codegen path already routes through
     `ArrayExpressionUtils.sliceLength`, which clamps the length to the number 
of elements
     remaining after `startIndex`, so the interpreted and codegen paths 
disagreed.
   
     The fix routes the interpreted path through the same 
`ArrayExpressionUtils.sliceLength`
     helper that codegen uses (introduced in SPARK-57171). `sliceLength` 
rejects a negative
     length and clamps to `numElements - startIndex`, so `startIndex + 
resLength` can no
     longer overflow and both execution paths produce identical results.
   
     ### Why are the changes needed?
   
     `slice` silently returns wrong results (data loss) instead of the correct 
sub-array.
     For constant arguments the wrong value is produced even by default, because
     `ConstantFolding` evaluates the expression through the interpreted 
`eval()` at plan
     time. Example on released Spark 4.0.3 (Scala 2.13.16):
   
         spark.sql("SELECT slice(array(1,2,3,4,5,6), 2, 
2147483647)").show(false)
         -- returns []   (expected [2, 3, 4, 5, 6])
   The interpreted path is also used when whole-stage codegen falls back at 
runtime
     (e.g. when the generated method exceeds 
`spark.sql.codegen.hugeMethodLimit`), so the
     wrong result can surface without any internal configuration being set.
   
     (Spark 3.5 / Scala 2.12 is unaffected: 2.12's `slice` double-overflows 
`until - lo`
     back to a positive count and accidentally returns the correct elements.)
   
     ### Does this PR introduce _any_ user-facing change?
   
     Yes. `slice` with a large `length` now returns the correct tail of the 
array in the
     interpreted path, matching the codegen path, instead of an empty array.
   
     - Before (Spark 4.0+, interpreted/constant-folded):
       `slice(array(1,2,3,4,5,6), 2, 2147483647)` => `[]`
     - After:
       `slice(array(1,2,3,4,5,6), 2, 2147483647)` => `[2, 3, 4, 5, 6]`
   
     ### How was this patch tested?
   
     Added regression cases to `CollectionExpressionsSuite` ("Slice" test). 
`checkEvaluation`
     exercises both the interpreted and codegen paths, so the new cases fail on 
the old code
     (interpreted returned `[]`) and pass with the fix:
   
         checkEvaluation(Slice(a0, Literal(2), Literal(Int.MaxValue)), Seq(2, 
3, 4, 5, 6))
         checkEvaluation(Slice(a0, Literal(1), Literal(Int.MaxValue)), Seq(1, 
2, 3, 4, 5, 6))
   
     Ran the full suite locally:
   
         build/sbt 'catalyst/testOnly 
org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite'
         -- Tests: succeeded 61, failed 0
   
     Also reproduced the original bug on a released Spark 4.0.3 (Scala 2.13.16) 
`spark-shell`
     and confirmed the corrected behavior with the fix.
   
     ### Was this patch authored or co-authored using generative AI tooling?
   
     Generated-by: Claude Code (Claude Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to