Hi,

Looks like a fun discussion: it's too bad for me I did not join it earlier. My first try at scipy-cluster was completely in Python. Like you, I also tried to find the most efficient way to transform the distance matrix when joining two clusters. Eventually my data sets became big enough that I decided to write these parts in C. I don't think my Python joining code was efficient as yours.

I tried out your first test and I am a little confused at the output.

In [107]: goodman.test()
1
Clusters
[['BA'], ['FI'], ['MI', 'TO'], ['NA'], ['RM']]
Distance
[[997 662 255 412 996]
 [662 997 468 268 400]
 [255 468 997 219 869]
 [412 268 219 997 669]
 [996 400 869 669 997]]

2
Clusters
[['BA'], ['FI'], ['MI', 'TO', 'NA'], ['RM']]
Distance
[[998 662 412 996]
 [662 998 268 400]
 [412 268 998 669]
 [996 400 669 998]]

3
Clusters
[['BA'], ['FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[999 412 996]
 [412 999 669]
 [996 669 999]]

4
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[1000  669]
 [ 669 1000]]

5
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA', 'RM']]
Distance
[[1001]]

The first step is right, singletons 2 and 5 (starting at 0) should be joined since they have a minimum distance of 138. Let's look at their corresponding rows in the distance matrix.

In [101]: DM[[2,5],:]
Out[101]:
array([[   877.,    295.,  10000.,    754.,    564.,    138.],
       [   996.,    400.,    138.,    869.,    669.,  10000.]])

These two rows, rows 2 and 5, are all that we need to form the row for the newly joined cluster in the distance matrix. If we just take the minimum for each column we obtain,

In [102]: q=DM[[2,5],:].min(axis=0)
Out[102]: array([ 877.,  295.,  138.,  754.,  564.,  138.])

so the row for the cluster should be the row above with the 2 and 5'th row removed. Roughly, there should be a row in the distance matrix with the following values but I don't see one in your output.

In [103]: q[q != 138]
Out[103]: array([ 877.,  295.,  754.,  564.])

Since 295 is the minimum distance between this newly joined cluster and any other singleton, it should not be chosen for the second iteration since singletons 3 and 4 are closer to another with a distance of 219. So after iteration 2, you should get [['BA'], ['FI'], ['MI', 'TO'], ['NA', 'RM']].

Recall that the distance matrix transformation forms a new distance matrix using only values from the previous distance matrix. So, at any iteration, the values in the distance matrix should be a subset of the values in the original distance matrix, eliminating the distance entries of the clusters formed.

If we look at the minimum distances in the original distance matrix in rank order, we have 138, 219, 255, 268, 295. Thus, we might expect the minimum distances found at each iteration to be these values, and they are in this case, but I don't have a mathematical proof that it works in general. If I run your distance matrix through hcluster.single, I get the following linkage matrix. The third column is the distance between the clusters joined, and the first two columns are the indices of the clusters joined (non-singletons have an index >= n).

array([[   2.,    5.,  138.,    2.],
       [   3.,    4.,  219.,    2.],
       [   0.,    7.,  255.,    3.],
       [   1.,    8.,  268.,    4.],
       [   6.,    9.,  295.,    6.]])

I've attached the dendrogram, since it is easier to interpret.

In [105]: lbls
Out[105]: ['BA', 'FI', 'MI', 'NA', 'RM', 'TO']
In [106]: hcluster.dendrogram(Z, labels=lbls)

I tried running your second test, and you'll see C might give you a better performance speed-up (not surprising). Roughly, what I'm doing in C is I'm only storing the upper triangular of the distance matrix. An array of double*'s (double **) refers to each row of this triangle. To eliminate a row, I simply remove the entry in the double ** array. To remove a column, I shift the values over in each non-removed row. I'm not sure if this is the best approach but it is certainly more efficient than what can be achieved in Python.

In [107]: hcluster.goodman.test2(1000)
n = 1000 took 22.10 seconds
In [108]: n=1000
In [109]: uu=numpy.random.rand(n*(n-1)/2)
In [110]: tic = time.time(); hcluster.single(uu); toc = time.time(); print toc-tic
Out[110]:
4.57607889175

Damian

Keith Goodman wrote:
On Fri, May 2, 2008 at 7:25 PM, Robert Kern <[EMAIL PROTECTED]> wrote:
 Assuming x is contiguous and you can modify x in-place:


 In [1]: from numpy import *

 In [2]: def dist(x):
   ...:    x = x + 1e10 * eye(x.shape[0])
   ...:    i, j = where(x == x.min())

   ...:    return i[0], j[0]
   ...:

 In [3]: def symmdist(N):
   ...:     x = random.rand(N, N)
   ...:     x = x + x.T
   ...:     x.flat[::N+1] = 0
   ...:     return x
   ...:

 In [4]: symmdist(5)
 Out[4]:
 array([[ 0.        ,  0.87508654,  1.11691704,  0.80366071,  0.57966808],
       [ 0.87508654,  0.        ,  1.5521685 ,  1.74010886,  0.52156877],
       [ 1.11691704,  1.5521685 ,  0.        ,  1.22725396,  1.04101992],
       [ 0.80366071,  1.74010886,  1.22725396,  0.        ,  1.94577965],
       [ 0.57966808,  0.52156877,  1.04101992,  1.94577965,  0.        ]])

 In [5]: def kerndist(x):
   ...:     N = x.shape[0]
   ...:     x.flat[::N+1] = x.max()
   ...:     ij = argmin(x.flat)
   ...:     i, j = divmod(ij, N)
   ...:     return i, j
   ...:

 In [10]: x = symmdist(500)

 In [15]: %timeit dist(x)
 10 loops, best of 3: 19.9 ms per loop

 In [16]: %timeit kerndist(x)
 100 loops, best of 3: 4.38 ms per loop

I added

i, j = divmod(x.argmin(), x.shape[0])

to

http://scipy.org/PerformanceTips

<<inline: example.png>>

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to