[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

derrickburns Mon, 19 Jan 2015 17:19:24 -0800

Github user derrickburns commented on the pull request:

    https://github.com/apache/spark/pull/2634#issuecomment-70589276
  
    In my application (n-gram contexts), the sparse vectors can be of extremely 
high dimension. To make the problem manageable, I select the k most important 
dimensions per point. For a cluster of m points, I can have a sparse vector of 
m*k non-zero values. Since some clusters can become quite large ( O(n) is 
size), I can get sparse vectors of O(nk) non-zero values.  Still, the vector is 
sparse, since the dimension of the vector is potentially 2^31-1.
    
    So, I cannot treat the vector as dense. 
    
    To deal with the growth problem, I have implemented a centroid that only 
retains the k most important features.
    
    Sent from my iPhone
    
    > On Jan 19, 2015, at 5:07 PM, Xiangrui Meng <[email protected]> 
wrote:
    > 
    > @derrickburns We use dense vectors to store cluster centers, because the 
centers are very likely to become dense during aggregation. If there are zeros, 
they can be efficiently compressed before sending back to the driver. For 
performance, we never add two sparse vectors together. I'm not sure whether 
this answers your question.
    > 
    > â
    > Reply to this email directly or view it on GitHub.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

Reply via email to