zhengruifeng opened a new pull request, #56395: URL: https://github.com/apache/spark/pull/56395
### What changes were proposed in this pull request? `zipWithIndex` builds a `ZippedWithIndexRDD`, which precomputes the start index of every partition by launching a small counting job over all but the last partition. This PR makes that step reuse the work an ancestor `ZippedWithIndexRDD` has already done. When walking the lineage in `getAncestorWithSamePartitionSizes`, the walk now stops as soon as a `ZippedWithIndexRDD` is reached. If the resolved ancestor is itself a `ZippedWithIndexRDD`, its `startIndices` are reused directly instead of running another counting job. The walk only crosses size-preserving operators (`map`, size-preserving `mapPartitions`, size-preserving zips, ...), so each partition's start index is guaranteed to match the ancestor's. As a result, a chain such as `rdd.zipWithIndex().map(f).zipWithIndex()` runs the counting job only once. ### Why are the changes needed? The counting job submitted by `zipWithIndex` is pure overhead when the per-partition indices have already been computed upstream. Pipelines that re-`zipWithIndex` after size-preserving transforms paid for a redundant job every time. Reusing the ancestor's `startIndices` removes that extra job. ### Does this PR introduce any user-facing change? No. The computed indices are identical; only the redundant counting job is avoided. ### How was this patch tested? New unit tests in `RDDSuite` that use a `SparkListener` to count submitted jobs: - reuse through a single size-preserving `map`; - direct `zipWithIndex().zipWithIndex()` chaining; - reuse through multiple chained `map`s; - a negative case where a `filter` breaks the chain, confirming a fresh counting job is still submitted and the indices are recomputed correctly. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (model: claude-opus-4-8) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
