[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...

andrewor14 Wed, 30 Jul 2014 20:36:28 -0700

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/1679


    [SPARK-2316] Avoid O(blocks) operations in listeners

    The existing code in `StorageUtils` is not the most efficient. Every time 
we want to update an `RDDInfo` we end up iterating through all blocks on all 
block managers just to discard most of them. The symptoms manifest themselves 
in the bountiful UI bugs observed in the wild. Many of these bugs are caused by 
the slow consumption of events in `LiveListenerBus`, which frequently leads to 
the event queue overflowing and `SparkListenerEvent`s being dropped on the 
floor. The changes made in this PR avoid this by first filtering out only the 
blocks relevant to us before computing storage information from them.
    
    It's worth a mention that this corner of the Spark code is also not very 
well-tested at all. The bulk of the changes in this PR is actually test cases 
for various logic in `StorageUtils.scala`. These will eventually be extended to 
cover the various listeners that constitute the `SparkUI`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark fix-drop-events

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1679.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1679
    
----
commit 53af15d25e19b8b63bcad035e3149e2920943561
Author: Andrew Or <[email protected]>
Date:   2014-07-30T23:20:40Z

    Refactor StorageStatus + add a bunch of tests
    
    This commit refactors storage status to keep around a set of RDD
    IDs which have blocks stored in the status' block manager. The
    purpose is such that we don't have to linearly scan through every
    single storage status' blocks if it doesn't even contain blocks
    for the RDD we're interested in in the first place.
    
    This commit also adds a bunch of tests for StorageStatus and
    StorageUtils methods. There were previously a few minor bugs in
    StorageUtils.blockLocationsFromStorageStatus and
    StorageUtils.filterStorageStatusByRDD that are now fixed and tested.
    
    Going forward, we need to first cleanup the method signatures to
    reflect what they actually do. Then we will make things more
    efficient now that we've set the stage.

commit 41fa50df1fc520802905b2f716b2008004c7c79d
Author: Andrew Or <[email protected]>
Date:   2014-07-31T01:51:56Z

    Add a legacy constructor for StorageStatus
    
    This just makes it easier to create one with a source of blocks.

commit 7b2c4aae86c784e117809fd857c31a3a402dd958
Author: Andrew Or <[email protected]>
Date:   2014-07-31T02:18:52Z

    Rewrite blockLocationsFromStorageStatus + clean up method signatures
    
    The existing implementation of blockLocationFromStorageStatus relies
    on a groupBy, which is somewhat expensive. The new code creates a map
    from the get go and adds the block locations by iterating through the
    storage statuses' blocks.
    
    This commit also cleans up StorageUtils method signatures by removing
    unnecessary methods and renaming others with long-winded names.

commit 8e91921983fc385896d9946303debd9e77652d6c
Author: Andrew Or <[email protected]>
Date:   2014-07-31T03:21:31Z

    Iterate through a filtered set of blocks when updating RDDInfo
    
    This particular commit is the whole point of this PR. In the existing
    code we unconditionally iterate through all blocks in all block managers
    whenever we want to update an RDDInfo. Now, we filter out only the
    blocks of interest to us in advance, so we don't end up constructing
    a huge map and doing a groupBy on it.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...

Reply via email to