Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1679#issuecomment-50946509
  
    After the latest changes, I did some more benchmarking running the same 
jobs with more partitions. I ran two experiments with different number of 
partitions, 1000 and 10000. For each experiment, I ran the following job 10 
times, one immediately after another, and recorded the max queue length.
    ```
    sc.parallelize(1 to 20000, numPartitions).persist().count()
    ```
    
    **Before.** In the experiment with 1000-partition RDDs, the event queue 
length exceeded 10k at around the 6th job and eventually reached 18019 after 
the 10th job. These numbers are even higher for 10000-partition RDDs, in which 
the event queue length already exceeded 10k by the 1st job, and finally reached 
*193162* after the 10th. Under this (contrived) workload, even raising the 
event queue limit by 20x, as proposed in #1579, is not sufficient. These 
results are derived from commit 78f2af582286b81e6dc9fa9d455ed2b369d933bd in 
master. 
    
    **After.** The event queue length never exceeded 507 for 1000 partitions 
and 786 for 10000 partitions. In both experiments, the number of events fit 
comfortably below 10k, the threshold before we start dropping events.
    
    **Baseline.** Without caching, the maximum queue length was 193 for 1000 
partitions and 423 for 10000 partitions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to