[PR] [CORE] Reuse start indices from an ancestor ZippedWithIndexRDD [spark]

via GitHub Tue, 09 Jun 2026 05:11:17 -0700


zhengruifeng opened a new pull request, #56403:
URL: https://github.com/apache/spark/pull/56403


   ### What changes were proposed in this pull request?
   
   This PR lets `ZippedWithIndexRDD` reuse partition start indices that an 
ancestor has already computed, instead of always launching a counting job.
   
   `getAncestorWithSamePartitionSizes` is extended to:
   - Track recursion depth and return it alongside the ancestor, so the deepest 
qualifying ancestor can be chosen.
   - Handle `ZippedPartitionsRDD2` (when it preserves partition sizes) by 
recursing into both parents and keeping the deeper match.
   - Stop at a `ZippedWithIndexRDD` ancestor whose `startIndices` are already 
populated. The field is `@transient`, so the guard against a `null` value lets 
a deserialized ancestor fall through to a counting job.
   
   When the resolved ancestor is itself a `ZippedWithIndexRDD`, `startIndices` 
reuses that ancestor's indices directly and skips the counting job entirely.
   
   ### Why are the changes needed?
   
   Computing start indices currently runs a job to count every partition's 
size. When an ancestor with identical partition sizes has already done this 
work, that job is redundant; reusing the ancestor's result avoids the extra 
pass over the data.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Relies on existing `ZippedWithIndexRDD` test coverage; this draft PR runs CI 
for full validation.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (model: claude-opus-4-8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [CORE] Reuse start indices from an ancestor ZippedWithIndexRDD [spark]

Reply via email to