GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/1679
[SPARK-2316] Avoid O(blocks) operations in listeners
The existing code in `StorageUtils` is not the most efficient. Every time
we want to update an `RDDInfo` we end up iterating through all blocks on all
block managers just to discard most of them. The symptoms manifest themselves
in the bountiful UI bugs observed in the wild. Many of these bugs are caused by
the slow consumption of events in `LiveListenerBus`, which frequently leads to
the event queue overflowing and `SparkListenerEvent`s being dropped on the
floor. The changes made in this PR avoid this by first filtering out only the
blocks relevant to us before computing storage information from them.
It's worth a mention that this corner of the Spark code is also not very
well-tested at all. The bulk of the changes in this PR is actually test cases
for various logic in `StorageUtils.scala`. These will eventually be extended to
cover the various listeners that constitute the `SparkUI`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark fix-drop-events
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1679.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1679
----
commit 53af15d25e19b8b63bcad035e3149e2920943561
Author: Andrew Or <[email protected]>
Date: 2014-07-30T23:20:40Z
Refactor StorageStatus + add a bunch of tests
This commit refactors storage status to keep around a set of RDD
IDs which have blocks stored in the status' block manager. The
purpose is such that we don't have to linearly scan through every
single storage status' blocks if it doesn't even contain blocks
for the RDD we're interested in in the first place.
This commit also adds a bunch of tests for StorageStatus and
StorageUtils methods. There were previously a few minor bugs in
StorageUtils.blockLocationsFromStorageStatus and
StorageUtils.filterStorageStatusByRDD that are now fixed and tested.
Going forward, we need to first cleanup the method signatures to
reflect what they actually do. Then we will make things more
efficient now that we've set the stage.
commit 41fa50df1fc520802905b2f716b2008004c7c79d
Author: Andrew Or <[email protected]>
Date: 2014-07-31T01:51:56Z
Add a legacy constructor for StorageStatus
This just makes it easier to create one with a source of blocks.
commit 7b2c4aae86c784e117809fd857c31a3a402dd958
Author: Andrew Or <[email protected]>
Date: 2014-07-31T02:18:52Z
Rewrite blockLocationsFromStorageStatus + clean up method signatures
The existing implementation of blockLocationFromStorageStatus relies
on a groupBy, which is somewhat expensive. The new code creates a map
from the get go and adds the block locations by iterating through the
storage statuses' blocks.
This commit also cleans up StorageUtils method signatures by removing
unnecessary methods and renaming others with long-winded names.
commit 8e91921983fc385896d9946303debd9e77652d6c
Author: Andrew Or <[email protected]>
Date: 2014-07-31T03:21:31Z
Iterate through a filtered set of blocks when updating RDDInfo
This particular commit is the whole point of this PR. In the existing
code we unconditionally iterate through all blocks in all block managers
whenever we want to update an RDDInfo. Now, we filter out only the
blocks of interest to us in advance, so we don't end up constructing
a huge map and doing a groupBy on it.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---