Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1679#issuecomment-50857283
  
    I did some benchmarking by running the following job 100 times one 
immediately after another. Each job launches a many short-lived tasks, each of 
which persists a single block. The minimality of each task allows the listener 
bus to keep posting events very quickly while placing a lot of stress on the 
listeners on consuming the events.
    ```
    sc.parallelize(1 to 20000, 100).persist().count()
    ```
    **Before:** The max queue length observed reaches 10000 at around the 65th 
job, and finally reaches 16730 after the last job. Before this PR, this is 
enough to cause the queue to start dropping events. The average time spent in 
`StorageUtils.updateRddInfo` (this was renamed) is 176.25ms.
    
    **After:** The max queue length never went above 130, and the average time 
spent in `StorageUtils.updateRddInfo` is 15.47ms, more than 10 times faster 
than before.
    
    The dark side of the story (there is always a dark side), however, is that 
this improvement is only observed for RDDs with not too many partitions. 
Although the new code iterates through only a few RDDs' blocks instead of all 
RDD blocks known to mankind, it is still slow if say a single RDD contains all 
the blocks, in which case we still have to iterate through all the RDD blocks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to