[
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336177#comment-14336177
]
Derrick Burns edited comment on SPARK-4039 at 2/25/15 8:05 AM:
---------------------------------------------------------------
Supporting sparse data in the clusterer is a bad idea because centroid
calculation on sparse data can be very inefficient.
But so is converting sparse data directly to a dense representation for any
data of significant dimension.
So, what do you do?
The HashingTF is one approach to reduce the dimension of the data, but it is
not a good one. Collisions can lead to dramatic overestimates.
Instead, randoming indexing should be used.
Randoming Indexing (http://en.wikipedia.org/wiki/Random_indexing), per the
Johnson-Lindenstrauss Lemma, guarantees that the embedding in a lower dimension
space preserves certain distance measures.
See https://github.com/derrickburns/generalized-kmeans-clustering for an
implementation.
was (Author: derrickburns):
Support sparse data in the clusterer is a bad idea.
But so is converting sparse data directly to a dense representation for any
data of significant dimension.
So, what do you do?
The HashingTF is one approach to reduce the dimension of the data, but it is
not a good one. Collisions can lead to dramatic overestimates.
Instead, randoming indexing should be used.
Randoming Indexing (http://en.wikipedia.org/wiki/Random_indexing), per the
Johnson-Lindenstrauss Lemma, guarantees that the embedding in a lower dimension
space preserves certain distance measures.
See https://github.com/derrickburns/generalized-kmeans-clustering for an
implementation.
> KMeans support sparse cluster centers
> -------------------------------------
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.1.0
> Reporter: Antoine Amend
> Labels: clustering
>
> When the number of features is not known, it might be quite helpful to create
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors
> to dense vectors
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
> therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]