[
https://issues.apache.org/jira/browse/ATLAS-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roy Levin resolved ATLAS-437.
-----------------------------
Resolution: Duplicate
> Changes to support KMeans with large feature space
> --------------------------------------------------
>
> Key: ATLAS-437
> URL: https://issues.apache.org/jira/browse/ATLAS-437
> Project: Atlas
> Issue Type: Improvement
> Reporter: Roy Levin
>
> The problem:
> -----------------
> In Spark's KMeans code the center vectors are always represented as dense
> vectors. As a result, when each such center has a large domain space the
> algorithm quickly runs out of memory. In my example I have a feature space of
> around 50000 and k ~= 500. This sums up to around 200MB RAM for the center
> vectors alone while in fact the center vectors are very sparse and require a
> lot less RAM.
> Since I am running on a system with relatively low resources I keep getting
> OutOfMemory errors. In my setting it is OK to trade off runtime for using
> less RAM. This is what I set out to do in my solution while allowing users
> the flexibility to choose.
> One solution could be to reduce the dimensions of the feature space but this
> is not always the best approach. For example, when the object space is
> comprised of users and the feature space of items. In such an example we may
> want to run kmeans over a feature space which is a function of how many times
> user i clicked item j. If we reduce the dimensions of the items we will not
> be able to map the centers vectors back to the items. Moreover in a streaming
> context detecting the changes WRT previous runs gets more difficult.
> My solution:
> ----------------
> Allow the kmeans algorithm to accept a VectorFactory which decides when
> vectors used inside the algorithm should be sparse and when they should be
> dense. For backward compatibility the default behavior is to always make them
> dense (like the situation is now). But now potentially the user can provide a
> SmartVectorFactory (or some proprietary VectorFactory) which can decide to
> make vectors sparse.
> For this I made the following changes:
> (1) Added a method called reassign to SparseVectors allowing to change the
> indices and values
> (2) Allow axpy to accept SparseVectors
> (3) create a trait called VectorFactory and two implementations for it that
> are used within KMeans code
> To get the above described solution do the following:
> git clone https://github.com/levin-royl/spark.git -b
> SupportLargeFeatureDomains
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)