Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-69404412
Thanks for the information on the speedup that you obtained by eliminating
Breeze. I was unaware that the performance is so poor. To what do you
attribute the poor performance of Breeze?
One case certainly split out 1-4 AFTER doing 5. My point was that doing 5
(supporting Bregman divergences) requires touching practically every method
in the clustering package.
3. Supporting a variable number of clusters requires a significant rewrite
at well, since the assumption that there are K clusters is made repeatedly
in the code. If seems like adding a way to split clusters introduces a
whole other dimension of flexibility -- which I am not against -- that
would complicate the interface. For example, what heuristic would you use
to identify a cluster to split and how would you split it. It seems to me
that once you go down that path, you are headed toward a divisive
clustering algorithm.
4. As for the overhead of running k-means++ distributively, will someone
use Spark for small k and feature dimension?
5. Bregman divergences
As it turns out, I now have a need to perform clustering on sparse data, so
I am making a pass over my private implementation to 1) eliminate the use
of Breeze and 2) use the Spark Vector classes. When I am done, I can
include these modification in an updated pull request.
I do not understand you conclusion that for KL-divergence and generalized
I-divergence the points cannot be sparse. The domain need not be R^d+.
Rather, the domain can be R^infinity+.
For example, in the problem that I am currently working on, I am creating
sparse vectors where the vectors represent the frequency of occurrence of
the contexts for given n-grams. The space of the contexts is extremely
large, however, I am only interested in the M most frequent contexts per
n-gram. I use a sketch to identify the M most frequent contexts per
n-gram. This gives me an M length SparseVector of non-negative frequency
values. Thus, for each n-gram, I will have a SparseVector that represents
the distribution of the more frequently occurring contexts. I do not see a
problem using this SparseVector with the KL-divergence metric. Am I missing
something?
On Thu, Jan 8, 2015 at 3:55 PM, Xiangrui Meng <[email protected]>
wrote:
> @derrickburns <https://github.com/derrickburns> I like the improvements
> implemented in this PR. But as @srowen <https://github.com/srowen>
> mentioned, we have to resolve conflicts with the master branch before we
> can merge any PR. I compared the performance of this PR with master on
> minist-digits (60000x784, sparse, 10 clusters) locally and found the
master
> runs 2-3x faster. I guess this is majorly caused by two changes.
>
> 1. We replaced breeze operations by our own implementation. The latter
> is about 2-3x faster.
> 2. Running k-means++ distributively has noticeable overhead with small
> k and feature dimension.
>
> I think it is still feasible to include features through separate PRs:
>
> 1. remember previously computed best distances in k-means++
> initialization
> 2. allow fixing the random seed (addressed in #3610
> <https://github.com/apache/spark/pull/3610>)
> 3. variable number of clusters. We should discuss whether we want to
> have less than k clusters or split the biggest one if there are more
than k
> points.
> 4. parallelize k-means++. I think whether we should replace local
> k-means++ or make it configurable requires some discussion and
performance
> comparison.
> 5. support Bregman divergences
>
> Putting all of them together would certainly delay the review process and
> require resolving conflicts. I may have some time to prepare PRs for some
> of the features here, if you don't mind.
>
> For Bregman divergences, I'm thinking we can alter the formulation to
> support sparse vectors:
>
> d(x, y) = f(x) - f(y) - <x - y, g(y)> = f(x) - (f(y) - <y, g(y)>) - <x,
g(y)>
>
> where f(x), g(y), and f(y) - <y, g(y)> could be pre-computed and cached,
> and <x, g(y)> can take advantage of sparse x. But I'm not sure whether
> this formulation is really useful on any Bregman divergence rather than
the
> squared distance and the Mahalanobis distance. For KL-divergence and
> generalized I-divergence, the domain is R^d_+ and hence the points cannot
> be sparse.
>
> Besides those comments, I'm going to make some minor comments inline.
>
> â
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/2634#issuecomment-69272196>.
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]