GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/16328

    [SPARK-18808][ML][MLLIB] ml.KMeansModel.transform is very inefficient

    ## What changes were proposed in this pull request?
    
    mllib.KMeansModel.clusterCentersWithNorm is a method than ends up being 
called every time `predict` is called on a single vector, which is bad news for 
now the ml.KMeansModel Transformer works, which necessarily transforms one 
vector at a time.
    
    This causes the model to just store the vectors with norms upfront. The 
extra norm should be small compared to the vectors. This would avoid this form 
of overhead on this and other code paths.
    
    ## How was this patch tested?
    
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-18808

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16328.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16328
    
----
commit cddf553511e5a9684020b2e1a8ad04d3858db9ec
Author: Sean Owen <[email protected]>
Date:   2016-12-18T09:55:03Z

    Store vector norms upfront in KMeansModel to avoid computing every time in 
ml.KmeansModel.predict, etc

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to