Hi,  In my picture search project, I need a cluster algorithm to narrow 
the dataset, for accelerate the search on millions of pictures.
  First we use python+pytorch+kmean, with the growing data from thousands 
to millions, the KMeans clustering became slower and slower(seconds to 
minutes), then we find MiniBatchKMeans could amazing finish the clustering in 
1~2 seconds on millions of data.
  Meanwhile we still faced the insufficient concurrent capacity of python, 
so we switch to kotlin on jvm.
  But there did not a MinibatchKMeans algorithm in jvm yet, so I wrote one 
in kotlin, refer to the (python)sklearn MinibatchKMeans and Apache Commons 
Math(Deeplearning4j was also considered, but it is too slow because of ND4j's 
design).


  I'd like to contribute it to Apache Commons Math, and I wrote a java 
version: https://github.com/chentao106/commons-math/tree/feature-MiniBatchKMeans


  From my test(Kotlin version), it is very fast, but gives slightly 
different results with KMeans++ in most case, but sometimes has big 
different(May be affected by the randomness of the mini batch):




Some bad case:

It even worse when I use RandomSource.create(RandomSource.MT_64, 0) for 
the random generator ┐(´-`)┌.


My brief understanding of MiniBatchKMeans:
Use a partial points in initialize cluster centers, and random mini batch in 
training iterations.
It can finish in few seconds when clustering millions of data, and has few 
differences between KMeans.


More information about MiniBatchKMeans
  https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
  
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

Reply via email to