I've noticed (and occasionally been bothered by) this too, so I just decided
to fix it:
https://github.com/JuliaStats/Clustering.jl/pull/35
https://github.com/JuliaStats/Distances.jl/pull/9
It seems that kmeans was optimized for very high dimensions, but performed
poorly on low-dimensional data. We'll see what the reaction is.
Even with these I have a sense one could do yet better, but this is at least a
start.
Best,
--Tim
On Sunday, January 25, 2015 09:57:24 AM Martin Kapfhammer wrote:
> using DataFrames
> using Clustering
>
>
> raw_data = readtable("10000000x2s10.csv", header=false,
> eltypes=[Float64,Float64])
>
> matrix = transpose(array(raw_data))
>
> k = 3
>
> for i = 1:10
> print("new round ")
> println(i)
>
> #@time measuring
> @time result = kmeans(matrix, k)
> print("totalcost ")
> println(result.totalcost)
> print("iterations ")
> println(result.iterations)
> print("converged ")
> println(result.converged)
> end
>
>
> println("test run done")