Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/1679#issuecomment-50946509
After the latest changes, I did some more benchmarking running the same
jobs with more partitions. I ran two experiments with different number of
partitions, 1000 and 10000. For each experiment, I ran the following job 10
times, one immediately after another, and recorded the max queue length.
```
sc.parallelize(1 to 20000, numPartitions).persist().count()
```
**Before.** In the experiment with 1000-partition RDDs, the event queue
length exceeded 10k at around the 6th job and eventually reached 18019 after
the 10th job. These numbers are even higher for 10000-partition RDDs, in which
the event queue length already exceeded 10k by the 1st job, and finally reached
*193162* after the 10th. Under this (contrived) workload, even raising the
event queue limit by 20x, as proposed in #1579, is not sufficient. These
results are derived from commit 78f2af582286b81e6dc9fa9d455ed2b369d933bd in
master.
**After.** The event queue length never exceeded 507 for 1000 partitions
and 786 for 10000 partitions. In both experiments, the number of events fit
comfortably below 10k, the threshold before we start dropping events.
**Baseline.** Without caching, the maximum queue length was 193 for 1000
partitions and 423 for 10000 partitions.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---