GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/17078

    [SPARK-19746][ML] Faster indexing for logistic aggregator

    ## What changes were proposed in this pull request?
    
    JIRA: [SPARK-19746](https://issues.apache.org/jira/browse/SPARK-19746)
    
    The following code is inefficient:
    
    ````scala
        val localCoefficients: Vector = bcCoefficients.value
    
        features.foreachActive { (index, value) =>
          val stdValue = value / localFeaturesStd(index)
          var j = 0
          while (j < numClasses) {
            margins(j) += localCoefficients(index * numClasses + j) * stdValue
            j += 1
          }
        }
    ````
    
    `localCoefficients(index * numClasses + j)` calls `Vector.apply` which 
creates a new Breeze vector and indexes that. Even if it is not that slow to 
create the object, we will generate a lot of extra garbage that may result in 
longer GC pauses. This is a hot inner loop, so we should optimize wherever 
possible.
    
    ## How was this patch tested?
    
    I don't think there's a great way to test this patch. It's purely 
performance related, so unit tests should guarantee that we haven't made any 
unwanted changes. Empirically I observed between 10-40% speedups just running 
short local tests. I suspect the big differences will be seen when large 
data/coefficient sizes have to pause for GC more often. I welcome other ideas 
for testing.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark logistic_agg_indexing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17078.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17078
    
----
commit 3bea389f6780e1fd0385fbe26954fa4f59b69e37
Author: sethah <seth.hendrickso...@gmail.com>
Date:   2017-02-27T05:22:09Z

    better indexing for logistic agg

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to