[jira] [Commented] (MATH-1509) Implement the MiniBatchKMeansClusterer

Gilles Sadowski (Jira) Sun, 22 Mar 2020 08:08:46 -0700


    [ 
https://issues.apache.org/jira/browse/MATH-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064306#comment-17064306
 ]


Gilles Sadowski commented on MATH-1509:
---------------------------------------

I've merged PR #128 (commit 4128dbb2060dd960ca633c9260fb260b237d983a in 
"master" branch).

I had to amend the log message: Please try and keep the existing formatting 
convention (see "git log").

The PR needed many code style fixes; please review my changes. If you agree 
with them, please try and follow those guidelines in subsequent PRs. Don't 
hesitate to ask if something is unclear.

I've removed some of the code comments (they are just visual noise whenever the 
code is self-documenting).
 I've also removed the text reference to the Python library as, ideally, this 
"Commons" library should be self-contained for use by application developers: 
They should not have to perform a web search in order to find justification for 
why the code is how it is. For sure it's nice to provide more information, 
including web links, but they should appear as such in the Javadoc, using the 
HTML {{<a>}} tag or the Javadoc {{@see}} tag.

Please follow up with a new PR in order to
 * fix issues reported by "CheckStyle"
 * add missing documentation to class {{MiniBatchImprovementEvaluator}} (*all* 
fields and methods must be documented),
 * add {{package-info.java}} files in package {{evaluation}} and 
{{initialization}}.

The reports are generated by this command:
{noformat}
$ mvn clean package site
{noformat}
and you can view them under the generated site in directory {{target/site}}.

Note that you'll need to set the environment variable {{JAVA_HOME}} to point to 
a Java 8 JDK to avoid unit test failures on {{FastMath}} missing methods.

 

> Implement the MiniBatchKMeansClusterer
> --------------------------------------
>
>                 Key: MATH-1509
>                 URL: https://issues.apache.org/jira/browse/MATH-1509
>             Project: Commons Math
>          Issue Type: New Feature
>            Reporter: Chen Tao
>            Priority: Major
>         Attachments: compare.png, intensive-data-comparsion-badcase.png, 
> intensive-data-comparsion.png, random-data-comparison.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> MiniBatchKMeans is a fast clustering algorithm, 
> which use partial points in initialize cluster centers, and mini batch in 
> training iterations.
>  It can finish in few seconds on clustering millions of data, and has few 
> differences between KMeans.
> I have implemented it by Kotlin in my own project, and I'd like to contribute 
> the code  to Apache Commons Math, of course in java.
> My implemention is base on Apache Commons Math3, refer to Python 
> sklearn.cluster.MiniBatchKMeans
> Thought test I found it works well on intensive data, significant performance 
> improvement and return value has few difference to KMeans++, but has many 
> difference on sparse data.
>  
> Below is the comparation of my implemention and KMeansPlusPlusClusterer
>   !compare.png!
>  
> I have created a pull request on 
> [https://github.com/apache/commons-math/pull/117], for reference only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MATH-1509) Implement the MiniBatchKMeansClusterer

Reply via email to