zhengruifeng opened a new pull request, #54169:
URL: https://github.com/apache/spark/pull/54169
### What changes were proposed in this pull request?
Mitigate the recomputation in `zipWithIndex`
### Why are the changes needed?
`zipWithIndex` triggers an extra job to compute the `startIndices`.
If the parent RDD holds the same data distribution: number of partitions and
number of rows per partition, then we can use parent RDD to compute the
`startIndices`, so that the internal computation might be skipped.
It should benefit such patterns:
```
rdd.map(expensive computation).zipWithIndex
df.select(expensive computation).zipWithIndex
```
It should be able to add other operators, but this PR focus on `RDD.map` and
`ProjectExec`.
manually check with
```scala
val rdd = sc.range(0, 10, 1, 4)
val start = System.currentTimeMillis()
rdd.map(x => {Thread.sleep(10000); x + 1}).zipWithIndex().collect()
val duration = System.currentTimeMillis() - start
```
master:
```scala
val rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[1] at range at
<console>:1
val start: Long = 1770351594037
val duration: Long = 60651
```
this PR:
```scala
val rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[12] at range at
<console>:1
val start: Long = 1770351397114
val duration: Long = 30040
```
The original RDD `sc.range(0, 10, 1, 4)` was used to compute `startIndices`,
thus the expensive computation `x => {Thread.sleep(10000); x + 1}` was skipped.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI, will add tests
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]