Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/1679#issuecomment-50857283
I did some benchmarking by running the following job 100 times one
immediately after another. Each job launches a many short-lived tasks, each of
which persists a single block. The minimality of each task allows the listener
bus to keep posting events very quickly while placing a lot of stress on the
listeners on consuming the events.
```
sc.parallelize(1 to 20000, 100).persist().count()
```
**Before:** The max queue length observed reaches 10000 at around the 65th
job, and finally reaches 16730 after the last job. Before this PR, this is
enough to cause the queue to start dropping events. The average time spent in
`StorageUtils.updateRddInfo` (this was renamed) is 176.25ms.
**After:** The max queue length never went above 130, and the average time
spent in `StorageUtils.updateRddInfo` is 15.47ms, more than 10 times faster
than before.
The dark side of the story (there is always a dark side), however, is that
this improvement is only observed for RDDs with not too many partitions.
Although the new code iterates through only a few RDDs' blocks instead of all
RDD blocks known to mankind, it is still slow if say a single RDD contains all
the blocks, in which case we still have to iterate through all the RDD blocks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---