[GitHub] spark pull request #15735: [SPARK-18223] [CORE] Optimise PartitionedAppendOn...

a-roberts Wed, 02 Nov 2016 04:32:06 -0700

GitHub user a-roberts opened a pull request:

    https://github.com/apache/spark/pull/15735


    [SPARK-18223] [CORE] Optimise PartitionedAppendOnlyMap implementation

    ## What changes were proposed in this pull request?
    
    This class, like the PartitionedPairBuffer class, are both core Spark data 
structures that allow us to spill data to disk. 
    
    From the comment in ExternalSorter before instantiating said data 
structures:
    // Data structures to store in-memory objects before we spill. Depending on 
whether we have an
    // Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
    // store them in an array buffer.
    
    All of our data within RDDs has a partition ID and the ordering operations 
will order by a partition before any other criteria. Such data structures share 
a partitionKeyComparator from WriteablePartitionedPairCollection.
    
    While this change adds more code, it is the bad iterator wrapping we remove 
that has a negative performance impact. In this case we avoid said wrapping to 
help the inliner. When avoided we've observed a 3% PageRank performance 
increase on HiBench large for both IBM's SDK for Java and OpenJDK 8 as a result 
of the inliner being better able to figure out what's going on. This 
observation is seen when combined with an optimisation PartitionedPairBuffer 
implementation I'll also contribute.
    
    ## How was this patch tested?
    
    Existing unit tests and HiBench large, PageRank benchmark specifically.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/a-roberts/spark patch-10

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15735.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15735
    
----
commit 7d2e53afdddd6d9c903f89c2457ca4b256693849
Author: Adam Roberts <[email protected]>
Date:   2016-11-02T11:30:51Z

    [SPARK-18223] [CORE] Optimise PartitionedAppendOnlyMap implementation
    
    This class, like the PartitionedPairBuffer class, are both core Spark data 
structures that allow us to spill data to disk. 
    
    From the comment in ExternalSorter before instantiating said data 
structures:
    // Data structures to store in-memory objects before we spill. Depending on 
whether we have an
    // Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
    // store them in an array buffer.
    
    All of our data within RDDs has a partition ID and the ordering operations 
will order by a partition before any other criteria. Such data structures share 
a partitionKeyComparator from WriteablePartitionedPairCollection.
    
    While this change adds more code, it is the bad iterator wrapping we remove 
that has a negative performance impact. In this case we avoid said wrapping to 
help the inliner. When avoided we've observed a 3% PageRank performance 
increase on HiBench large for both IBM's SDK for Java and OpenJDK 8 as a result 
of the inliner being better able to figure out what's going on. This 
observation is seen when combined with an optimisation PartitionedPairBuffer 
implementation I'll also contribute.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15735: [SPARK-18223] [CORE] Optimise PartitionedAppendOn...

Reply via email to