zhengruifeng opened a new pull request, #56395:
URL: https://github.com/apache/spark/pull/56395

   ### What changes were proposed in this pull request?
   
   `zipWithIndex` builds a `ZippedWithIndexRDD`, which precomputes the start 
index of every partition by launching a small counting job over all but the 
last partition. This PR makes that step reuse the work an ancestor 
`ZippedWithIndexRDD` has already done.
   
   When walking the lineage in `getAncestorWithSamePartitionSizes`, the walk 
now stops as soon as a `ZippedWithIndexRDD` is reached. If the resolved 
ancestor is itself a `ZippedWithIndexRDD`, its `startIndices` are reused 
directly instead of running another counting job. The walk only crosses 
size-preserving operators (`map`, size-preserving `mapPartitions`, 
size-preserving zips, ...), so each partition's start index is guaranteed to 
match the ancestor's.
   
   As a result, a chain such as `rdd.zipWithIndex().map(f).zipWithIndex()` runs 
the counting job only once.
   
   ### Why are the changes needed?
   
   The counting job submitted by `zipWithIndex` is pure overhead when the 
per-partition indices have already been computed upstream. Pipelines that 
re-`zipWithIndex` after size-preserving transforms paid for a redundant job 
every time. Reusing the ancestor's `startIndices` removes that extra job.
   
   ### Does this PR introduce any user-facing change?
   
   No. The computed indices are identical; only the redundant counting job is 
avoided.
   
   ### How was this patch tested?
   
   New unit tests in `RDDSuite` that use a `SparkListener` to count submitted 
jobs:
   - reuse through a single size-preserving `map`;
   - direct `zipWithIndex().zipWithIndex()` chaining;
   - reuse through multiple chained `map`s;
   - a negative case where a `filter` breaks the chain, confirming a fresh 
counting job is still submitted and the indices are recomputed correctly.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (model: claude-opus-4-8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to