Hi John, Thanks for the tip, but actually I'm not using this function for production. I was reading the "Programming Collective Intelligence", and trying to implement the examples in Julia rather than Python (with some complications as missing packages, like Beatiful Soup, but thats ok...). So, this is an exercise to help me with understanding these algorithms, and learn more Julia at the same time.
The next book I'll try this is, guess what, "Machine Learning for hackers"! Hope that the transition of the algorithms on that book is easier. Thanks again! Em quinta-feira, 3 de julho de 2014 19h29min37s UTC-3, John Myles White escreveu: > > Hi Paulo, > > Rather than implement k-means from scratch, I'd encourage you to use the > implementation in the Clustering.jl package. > > -- John > > On Jul 3, 2014, at 2:51 PM, Paulo Castro <[email protected] > <javascript:>> wrote: > > Hi guys, > > I'm trying to implement the K-Means Clustering Algorithm, but I'm having > some problems. The function I wrote: > > function kcluster(data; distance = pearson, k=4) > # Generate a list of tuples of the min and max values of each column > of "data" > ranges = [(minimum(data[:,i]), maximum(data[:,i])) for i in 1:size( > data,2)] > > # Create k randomly placed centroids > centroids = [rand()*ranges[j][2] - ranges[j][1] + ranges[j][1] for i > in 1:k, j in 1:length(ranges)] > > lastmatches = Any[] > for t in 1:100 > println("Iteration $t") > bestmatches = [Int[] for i in 1:k] > > # Get best matches for each cluster > for j in 1:size(data, 1) > row = data[j, :] > bestmatch = 1 > bestd = distance(centroids[bestmatch, :], row) > > for i in 1:k > d = distance(centroids[i, :], row) > if d < bestd > bestd = d > bestmatch = i > end > end > > push!(bestmatches[bestmatch], j) > end > > if lastmatches == bestmatches > return lastmatches > end > > lastmatches = bestmatches > > # Move clusters to the average of its matches > numcols = size(data, 2) > for i in 1:k > avgs = zeros(1, numcols) > if length(bestmatches[i]) > 0 > for row in bestmatches[i] > avgs += data[row, :] > end > > avgs /= length(bestmatches[i]) > centroids[i, :] = avgs > end > end > end > > return lastmatches > end > > The "data" argument is a two dimensional Array, each row representing an > individual, and each column its position on space. > > The problem is the following: the same algorithm in Python (with the same > "data" input), use to stop near iteration #5, and in Julia it always goes > to the iteration #100. The not-empty clusters on Python are also smaller, > therefore there are less empty clusters. Can somebody find why it never > enters the "if lastmatches == bestmatches" block? > > Sorry about my poor english > > >
