Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-70589276
In my application (n-gram contexts), the sparse vectors can be of extremely
high dimension. To make the problem manageable, I select the k most important
dimensions per point. For a cluster of m points, I can have a sparse vector of
m*k non-zero values. Since some clusters can become quite large ( O(n) is
size), I can get sparse vectors of O(nk) non-zero values. Still, the vector is
sparse, since the dimension of the vector is potentially 2^31-1.
So, I cannot treat the vector as dense.
To deal with the growth problem, I have implemented a centroid that only
retains the k most important features.
Sent from my iPhone
> On Jan 19, 2015, at 5:07 PM, Xiangrui Meng <[email protected]>
wrote:
>
> @derrickburns We use dense vectors to store cluster centers, because the
centers are very likely to become dense during aggregation. If there are zeros,
they can be efficiently compressed before sending back to the driver. For
performance, we never add two sparse vectors together. I'm not sure whether
this answers your question.
>
> â
> Reply to this email directly or view it on GitHub.
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]