[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

mengxr Thu, 08 Jan 2015 15:55:31 -0800

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2634#issuecomment-69272196
  
    @derrickburns I like the improvements implemented in this PR. But as 
@srowen mentioned, we have to resolve conflicts with the master branch before 
we can merge any PR. I compared the performance of this PR with master on 
minist-digits (60000x784, sparse, 10 clusters) locally and found the master 
runs 2-3x faster. I guess this is majorly caused by two changes.
    
    1. We replaced breeze operations by our own implementation. The latter is 
about 2-3x faster.
    1. Running k-means++ distributively has noticeable overhead with small k 
and feature dimension.
    
    I think it is still feasible to include features through separate PRs:
    
    1. remember previously computed best distances in k-means++ initialization
    1. allow fixing the random seed (addressed in #3610)
    1. variable number of clusters. We should discuss whether we want to have 
less than k clusters or split the biggest one if there are more than k points. 
    1. parallelize k-means++. I think whether we should replace local k-means++ 
or make it configurable requires some discussion and performance comparison.
    1. support Bregman divergences
    
    Putting all of them together would certainly delay the review process and 
require resolving conflicts. I may have some time to prepare PRs for some of 
the features here, if you don't mind.
    
    For Bregman divergences, I'm thinking we can alter the formulation to 
support sparse vectors:
    
    ~~~
    d(x, y) = f(x)  - f(y) - <x - y, g(y)> = f(x) - (f(y) - <y, g(y)>) - <x, 
g(y)>
    ~~~
    
    where `f(x)`, `g(y)`, and `f(y) - <y, g(y)>` could be pre-computed and 
cached, and `<x, g(y)>` can take advantage of sparse `x`. But I'm not sure 
whether this formulation is really useful on any Bregman divergence rather than 
the squared distance and the Mahalanobis distance. For KL-divergence and 
generalized I-divergence, the domain is R^d_+ and hence the points cannot be 
sparse.
    
    Besides those comments, I'm going to make some minor comments inline.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

Reply via email to