[ 
https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341417#comment-14341417
 ] 

Derrick Burns commented on SPARK-6068:
--------------------------------------

Thanks Sean for your thoughtful comments!

The main requirement that drove my initial effort was to generalize the 
distance function used to include the provably largest class of distance 
functions for which the core algorithm works. This is the class of Bregman 
Divergences. 

Unfortunately, the current Spark implementation uses knowledge of the specific 
distance function in many places. Reversing that would result in code that is 
more general, just as efficient, much easier to read, and easier to prove 
correct. Alas, it would also touch many lines of code.

The other changes that I have made can be easily layered on the base change. 
They are largely independent. One could make those changes to either code base, 
(as one such change was recently implemented). 

However, I do not want to invest in supporting a code base that lacks my 
driving feature need.  Despite that, I report the issues that I find and fix 
that are shared in both implementations so that others may at least be aware of 
them. 

I think that my alternative implementation demonstrates that one can introduce 
my desired features with minimal impact to the user visible API, so this is not 
an API/backward compatibility issue like the new pipelines architecture. 

I'm happy to maintain a separate implementation and make it publicly available, 
particularly since my application requires a different distance function. Next 
week, I plan to release a version, if I can figure out how to do that easily. :)


Sent from my iPhone



> KMeans Parallel test may fail
> -----------------------------
>
>                 Key: SPARK-6068
>                 URL: https://issues.apache.org/jira/browse/SPARK-6068
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.1
>            Reporter: Derrick Burns
>              Labels: clustering
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The test  "k-means|| initialization in KMeansSuite can fail when the random 
> number generator is truly random.
> The test is predicated on the assumption that each round of K-Means || will 
> add at least one new cluster center.  The current implementation of K-Means 
> || adds 2*k cluster centers with high probability.  However, there is no 
> deterministic lower bound on the number of cluster centers added.
> Choices are:
> 1)  change the KMeans || implementation to iterate on selecting points until 
> it has satisfied a lower bound on the number of points chosen.
> 2) eliminate the test
> 3) ignore the problem and depend on the random number generator to sample the 
> space in a lucky manner. 
> Option (1) is most in keeping with the contract that KMeans || should provide 
> a precise number of cluster centers when possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to