GitHub user levin-royl opened a pull request:
https://github.com/apache/spark/pull/10739
Changes to support KMeans with large feature space
The problem:
------------------
In Spark's KMeans code the center vectors are always represented as dense
vectors. As a result, when each such center has a large domain space the
algorithm quickly runs out of memory. In my example I have a feature space of
around 50000 and k ~= 500. This sums up to around 200MB RAM for the center
vectors alone while in fact the center vectors are very sparse and require a
lot less RAM.
Since I am running on a system with relatively low resources I keep getting
OutOfMemory errors. In my setting it is OK to trade off runtime for using less
RAM. This is what I set out to do in my solution while allowing users the
flexibility to choose.
My solution:
----------------
Allow the kmeans algorithm to accept a VectorFactory which decides when
vectors used inside the algorithm should be sparse and when they should be
dense. For backward compatibility the default behavior is to always make them
dense (like the situation is now). But now potentially the user can provide a
SmartVectorFactory (or some proprietary VectorFactory) which can decide to make
vectors sparse.
For this I made the following changes:
(1) Added a method called reassign to SparseVectors allowing to change the
indices and values
(2) Allow axpy to accept SparseVectors
(3) create a trait called VectorFactory and two implementations for it that
are used within KMeans code
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/levin-royl/spark SupportLargeFeatureDomains
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10739.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10739
----
commit 33d760c7d848da66d8a84523f11a7fc38ff1afc4
Author: Roy Levin <[email protected]>
Date: 2016-01-13T10:47:11Z
Changes to support KMeans with large feature space
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]