post in wrong project? On Sun, Jan 17, 2016 at 2:58 PM, Roy Levin (JIRA) <[email protected]> wrote:
> Roy Levin created KYLIN-1326: > -------------------------------- > > Summary: Changes to support KMeans with large feature space > Key: KYLIN-1326 > URL: https://issues.apache.org/jira/browse/KYLIN-1326 > Project: Kylin > Issue Type: Improvement > Components: Spark > Reporter: Roy Levin > > > The problem: > ----------------- > In Spark's KMeans code the center vectors are always represented as dense > vectors. As a result, when each such center has a large domain space the > algorithm quickly runs out of memory. In my example I have a feature space > of around 50000 and k ~= 500. This sums up to around 200MB RAM for the > center vectors alone while in fact the center vectors are very sparse and > require a lot less RAM. > Since I am running on a system with relatively low resources I keep > getting OutOfMemory errors. In my setting it is OK to trade off runtime for > using less RAM. This is what I set out to do in my solution while allowing > users the flexibility to choose. > > One solution could be to reduce the dimensions of the feature space but > this is not always the best approach. For example, when the object space is > comprised of users and the feature space of items. In such an example we > may want to run kmeans over a feature space which is a function of how many > times user i clicked item j. If we reduce the dimensions of the items we > will not be able to map the centers vectors back to the items. Moreover in > a streaming context detecting the changes WRT previous runs gets more > difficult. > > > My solution: > ---------------- > Allow the kmeans algorithm to accept a VectorFactory which decides when > vectors used inside the algorithm should be sparse and when they should be > dense. For backward compatibility the default behavior is to always make > them dense (like the situation is now). But now potentially the user can > provide a SmartVectorFactory (or some proprietary VectorFactory) which can > decide to make vectors sparse. > > For this I made the following changes: > (1) Added a method called reassign to SparseVectors allowing to change the > indices and values > (2) Allow axpy to accept SparseVectors > (3) create a trait called VectorFactory and two implementations for it > that are used within KMeans code > > > To get the above described solution do the following: > > git clone https://github.com/levin-royl/spark.git -b > SupportLargeFeatureDomains > > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) > -- Regards, *Bin Mahone | 马洪宾* Apache Kylin: http://kylin.io Github: https://github.com/binmahone
