GitHub user levin-royl opened a pull request:

    https://github.com/apache/spark/pull/10739

    Changes to support KMeans with large feature space

    The problem:
    ------------------
    In Spark's KMeans code the center vectors are always represented as dense 
vectors. As a result, when each such center has a large domain space the 
algorithm quickly runs out of memory. In my example I have a feature space of 
around 50000 and k ~= 500. This sums up to around 200MB RAM for the center 
vectors alone while in fact the center vectors are very sparse and require a 
lot less RAM.
    Since I am running on a system with relatively low resources I keep getting 
OutOfMemory errors. In my setting it is OK to trade off runtime for using less 
RAM. This is what I set out to do in my solution while allowing users the 
flexibility to choose.
    
    My solution:
    ----------------
    Allow the kmeans algorithm to accept a VectorFactory which decides when 
vectors used inside the algorithm should be sparse and when they should be 
dense. For backward compatibility the default behavior is to always make them 
dense (like the situation is now). But now potentially the user can provide a 
SmartVectorFactory (or some proprietary VectorFactory) which can decide to make 
vectors sparse.
    
    For this I made the following changes:
    (1) Added a method called reassign to SparseVectors allowing to change the 
indices and values
    (2) Allow axpy to accept SparseVectors
    (3) create a trait called VectorFactory and two implementations for it that 
are used within KMeans code

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/levin-royl/spark SupportLargeFeatureDomains

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10739.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10739
    
----
commit 33d760c7d848da66d8a84523f11a7fc38ff1afc4
Author: Roy Levin <[email protected]>
Date:   2016-01-13T10:47:11Z

    Changes to support KMeans with large feature space

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to