Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-70345238
I see the problem with sparse vectors and the KL divergence.
I implemented a smoothing operation to approximate KL divergence.
Sent from my iPhone
> On Jan 8, 2015, at 3:55 PM, Xiangrui Meng <[email protected]>
wrote:
>
> @derrickburns I like the improvements implemented in this PR. But as
@srowen mentioned, we have to resolve conflicts with the master branch before
we can merge any PR. I compared the performance of this PR with master on
minist-digits (60000x784, sparse, 10 clusters) locally and found the master
runs 2-3x faster. I guess this is majorly caused by two changes.
>
> We replaced breeze operations by our own implementation. The latter is
about 2-3x faster.
> Running k-means++ distributively has noticeable overhead with small k and
feature dimension.
> I think it is still feasible to include features through separate PRs:
>
> remember previously computed best distances in k-means++ initialization
> allow fixing the random seed (addressed in #3610)
> variable number of clusters. We should discuss whether we want to have
less than k clusters or split the biggest one if there are more than k points.
> parallelize k-means++. I think whether we should replace local k-means++
or make it configurable requires some discussion and performance comparison.
> support Bregman divergences
> Putting all of them together would certainly delay the review process and
require resolving conflicts. I may have some time to prepare PRs for some of
the features here, if you don't mind.
>
> For Bregman divergences, I'm thinking we can alter the formulation to
support sparse vectors:
>
> d(x, y) = f(x) - f(y) - <x - y, g(y)> = f(x) - (f(y) - <y, g(y)>) - <x,
g(y)>
> where f(x), g(y), and f(y) - <y, g(y)> could be pre-computed and cached,
and <x, g(y)> can take advantage of sparse x. But I'm not sure whether this
formulation is really useful on any Bregman divergence rather than the squared
distance and the Mahalanobis distance. For KL-divergence and generalized
I-divergence, the domain is R^d_+ and hence the points cannot be sparse.
>
> Besides those comments, I'm going to make some minor comments inline.
>
> â
> Reply to this email directly or view it on GitHub.
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]