[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

derrickburns Fri, 09 Jan 2015 13:47:12 -0800

Github user derrickburns commented on the pull request:

    https://github.com/apache/spark/pull/2634#issuecomment-69404412
  
    Thanks for the information on the speedup that you obtained by eliminating
    Breeze.  I was unaware that the performance is so poor.  To what do you
    attribute the poor performance of Breeze?
    
    One case certainly split out 1-4 AFTER doing 5. My point was that doing 5
    (supporting Bregman divergences) requires touching practically every method
    in the clustering package.
    
    3. Supporting a variable number of clusters requires a significant rewrite
    at well, since the assumption that there are K clusters is made repeatedly
    in the code.  If seems like adding a way to split clusters introduces a
    whole other dimension of flexibility -- which I am not against -- that
    would complicate the interface.  For example, what heuristic would you use
    to identify a cluster to split and how would you split it.  It seems to me
    that once you go down that path, you are headed toward a divisive
    clustering algorithm.
    
    4. As for the overhead of running k-means++ distributively, will someone
    use Spark for small k and feature dimension?
    
    5. Bregman divergences
    
    As it turns out, I now have a need to perform clustering on sparse data, so
    I am making a pass over my private implementation to 1) eliminate the use
    of Breeze and 2) use the Spark Vector classes.   When I am done, I can
    include these modification in an updated pull request.
    
    I do not understand you conclusion that for KL-divergence and generalized
    I-divergence the points cannot be sparse.  The domain need not be R^d+.
    Rather, the domain can be R^infinity+.
    
    For example, in the problem that I am currently working on, I am creating
    sparse vectors where the vectors represent the frequency of occurrence of
    the contexts for given n-grams.  The space of the contexts is extremely
    large, however, I am only interested in the M most frequent contexts per
    n-gram.  I use a sketch to identify the M most frequent contexts per
    n-gram.  This gives me an M length SparseVector of non-negative frequency
    values.  Thus, for each n-gram, I will have a SparseVector that represents
    the distribution of the more frequently occurring contexts.  I do not see a
    problem using this SparseVector with the KL-divergence metric. Am I missing
    something?
    
    On Thu, Jan 8, 2015 at 3:55 PM, Xiangrui Meng <[email protected]>
    wrote:
    
    > @derrickburns <https://github.com/derrickburns> I like the improvements
    > implemented in this PR. But as @srowen <https://github.com/srowen>
    > mentioned, we have to resolve conflicts with the master branch before we
    > can merge any PR. I compared the performance of this PR with master on
    > minist-digits (60000x784, sparse, 10 clusters) locally and found the 
master
    > runs 2-3x faster. I guess this is majorly caused by two changes.
    >
    >    1. We replaced breeze operations by our own implementation. The latter
    >    is about 2-3x faster.
    >    2. Running k-means++ distributively has noticeable overhead with small
    >    k and feature dimension.
    >
    > I think it is still feasible to include features through separate PRs:
    >
    >    1. remember previously computed best distances in k-means++
    >    initialization
    >    2. allow fixing the random seed (addressed in #3610
    >    <https://github.com/apache/spark/pull/3610>)
    >    3. variable number of clusters. We should discuss whether we want to
    >    have less than k clusters or split the biggest one if there are more 
than k
    >    points.
    >    4. parallelize k-means++. I think whether we should replace local
    >    k-means++ or make it configurable requires some discussion and 
performance
    >    comparison.
    >    5. support Bregman divergences
    >
    > Putting all of them together would certainly delay the review process and
    > require resolving conflicts. I may have some time to prepare PRs for some
    > of the features here, if you don't mind.
    >
    > For Bregman divergences, I'm thinking we can alter the formulation to
    > support sparse vectors:
    >
    > d(x, y) = f(x)  - f(y) - <x - y, g(y)> = f(x) - (f(y) - <y, g(y)>) - <x, 
g(y)>
    >
    > where f(x), g(y), and f(y) - <y, g(y)> could be pre-computed and cached,
    > and <x, g(y)> can take advantage of sparse x. But I'm not sure whether
    > this formulation is really useful on any Bregman divergence rather than 
the
    > squared distance and the Mahalanobis distance. For KL-divergence and
    > generalized I-divergence, the domain is R^d_+ and hence the points cannot
    > be sparse.
    >
    > Besides those comments, I'm going to make some minor comments inline.
    >
    > â
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/2634#issuecomment-69272196>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

Reply via email to