Shaofeng SHI closed KYLIN-1326.

> Changes to support KMeans with large feature space
> --------------------------------------------------
>                 Key: KYLIN-1326
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1326
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Roy Levin
> The problem:
> -----------------
> In Spark's KMeans code the center vectors are always represented as dense 
> vectors. As a result, when each such center has a large domain space the 
> algorithm quickly runs out of memory. In my example I have a feature space of 
> around 50000 and k ~= 500. This sums up to around 200MB RAM for the center 
> vectors alone while in fact the center vectors are very sparse and require a 
> lot less RAM.
> Since I am running on a system with relatively low resources I keep getting 
> OutOfMemory errors. In my setting it is OK to trade off runtime for using 
> less RAM. This is what I set out to do in my solution while allowing users 
> the flexibility to choose.
> One solution could be to reduce the dimensions of the feature space but this 
> is not always the best approach. For example, when the object space is 
> comprised of users and the feature space of items. In such an example we may 
> want to run kmeans over a feature space which is a function of how many times 
> user i clicked item j. If we reduce the dimensions of the items we will not 
> be able to map the centers vectors back to the items. Moreover in a streaming 
> context detecting the changes WRT previous runs gets more difficult.
> My solution:
> ----------------
> Allow the kmeans algorithm to accept a VectorFactory which decides when 
> vectors used inside the algorithm should be sparse and when they should be 
> dense. For backward compatibility the default behavior is to always make them 
> dense (like the situation is now). But now potentially the user can provide a 
> SmartVectorFactory (or some proprietary VectorFactory) which can decide to 
> make vectors sparse.
> For this I made the following changes:
> (1) Added a method called reassign to SparseVectors allowing to change the 
> indices and values
> (2) Allow axpy to accept SparseVectors
> (3) create a trait called VectorFactory and two implementations for it that 
> are used within KMeans code
> To get the above described solution do the following:
> git clone https://github.com/levin-royl/spark.git -b 
> SupportLargeFeatureDomains

This message was sent by Atlassian JIRA

Reply via email to