Re: [Numpy-discussion] Faster

Damian Eads Mon, 05 May 2008 23:21:43 -0700

Hi,

Looks like a fun discussion: it's too bad for me I did not join itearlier. My first try at scipy-cluster was completely in Python. Likeyou, I also tried to find the most efficient way to transform thedistance matrix when joining two clusters. Eventually my data setsbecame big enough that I decided to write these parts in C. I don'tthink my Python joining code was efficient as yours.


I tried out your first test and I am a little confused at the output.

In [107]: goodman.test()
1
Clusters
[['BA'], ['FI'], ['MI', 'TO'], ['NA'], ['RM']]
Distance
[[997 662 255 412 996]
 [662 997 468 268 400]
 [255 468 997 219 869]
 [412 268 219 997 669]
 [996 400 869 669 997]]

2
Clusters
[['BA'], ['FI'], ['MI', 'TO', 'NA'], ['RM']]
Distance
[[998 662 412 996]
 [662 998 268 400]
 [412 268 998 669]
 [996 400 669 998]]

3
Clusters
[['BA'], ['FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[999 412 996]
 [412 999 669]
 [996 669 999]]

4
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[1000  669]
 [ 669 1000]]

5
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA', 'RM']]
Distance
[[1001]]

The first step is right, singletons 2 and 5 (starting at 0) should bejoined since they have a minimum distance of 138. Let's look at theircorresponding rows in the distance matrix.


In [101]: DM[[2,5],:]
Out[101]:
array([[   877.,    295.,  10000.,    754.,    564.,    138.],
       [   996.,    400.,    138.,    869.,    669.,  10000.]])

These two rows, rows 2 and 5, are all that we need to form the row forthe newly joined cluster in the distance matrix. If we just take theminimum for each column we obtain,


In [102]: q=DM[[2,5],:].min(axis=0)
Out[102]: array([ 877.,  295.,  138.,  754.,  564.,  138.])

so the row for the cluster should be the row above with the 2 and 5'throw removed. Roughly, there should be a row in the distance matrix withthe following values but I don't see one in your output.


In [103]: q[q != 138]
Out[103]: array([ 877.,  295.,  754.,  564.])

Since 295 is the minimum distance between this newly joined cluster andany other singleton, it should not be chosen for the second iterationsince singletons 3 and 4 are closer to another with a distance of 219.So after iteration 2, you should get [['BA'], ['FI'], ['MI', 'TO'],['NA', 'RM']].

Recall that the distance matrix transformation forms a new distancematrix using only values from the previous distance matrix. So, at anyiteration, the values in the distance matrix should be a subset of thevalues in the original distance matrix, eliminating the distance entriesof the clusters formed.

If we look at the minimum distances in the original distance matrix inrank order, we have 138, 219, 255, 268, 295. Thus, we might expect theminimum distances found at each iteration to be these values, and theyare in this case, but I don't have a mathematical proof that it works ingeneral. If I run your distance matrix through hcluster.single, I getthe following linkage matrix. The third column is the distance betweenthe clusters joined, and the first two columns are the indices of theclusters joined (non-singletons have an index >= n).


array([[   2.,    5.,  138.,    2.],
       [   3.,    4.,  219.,    2.],
       [   0.,    7.,  255.,    3.],
       [   1.,    8.,  268.,    4.],
       [   6.,    9.,  295.,    6.]])

I've attached the dendrogram, since it is easier to interpret.

In [105]: lbls
Out[105]: ['BA', 'FI', 'MI', 'NA', 'RM', 'TO']
In [106]: hcluster.dendrogram(Z, labels=lbls)

I tried running your second test, and you'll see C might give you abetter performance speed-up (not surprising). Roughly, what I'm doing inC is I'm only storing the upper triangular of the distance matrix. Anarray of double*'s (double **) refers to each row of this triangle. Toeliminate a row, I simply remove the entry in the double ** array. Toremove a column, I shift the values over in each non-removed row. I'mnot sure if this is the best approach but it is certainly more efficientthan what can be achieved in Python.


In [107]: hcluster.goodman.test2(1000)
n = 1000 took 22.10 seconds
In [108]: n=1000
In [109]: uu=numpy.random.rand(n*(n-1)/2)

In [110]: tic = time.time(); hcluster.single(uu); toc = time.time();print toc-tic

Out[110]:
4.57607889175

Damian

Keith Goodman wrote:

On Fri, May 2, 2008 at 7:25 PM, Robert Kern <[EMAIL PROTECTED]> wrote:

 Assuming x is contiguous and you can modify x in-place:


 In [1]: from numpy import *

 In [2]: def dist(x):
   ...:    x = x + 1e10 * eye(x.shape[0])
   ...:    i, j = where(x == x.min())

   ...:    return i[0], j[0]
   ...:

 In [3]: def symmdist(N):
   ...:     x = random.rand(N, N)
   ...:     x = x + x.T
   ...:     x.flat[::N+1] = 0
   ...:     return x
   ...:

 In [4]: symmdist(5)
 Out[4]:
 array([[ 0.        ,  0.87508654,  1.11691704,  0.80366071,  0.57966808],
       [ 0.87508654,  0.        ,  1.5521685 ,  1.74010886,  0.52156877],
       [ 1.11691704,  1.5521685 ,  0.        ,  1.22725396,  1.04101992],
       [ 0.80366071,  1.74010886,  1.22725396,  0.        ,  1.94577965],
       [ 0.57966808,  0.52156877,  1.04101992,  1.94577965,  0.        ]])

 In [5]: def kerndist(x):
   ...:     N = x.shape[0]
   ...:     x.flat[::N+1] = x.max()
   ...:     ij = argmin(x.flat)
   ...:     i, j = divmod(ij, N)
   ...:     return i, j
   ...:

 In [10]: x = symmdist(500)

 In [15]: %timeit dist(x)
 10 loops, best of 3: 19.9 ms per loop

 In [16]: %timeit kerndist(x)
 100 loops, best of 3: 4.38 ms per loop


I added

i, j = divmod(x.argmin(), x.shape[0])

to

http://scipy.org/PerformanceTips

<<inline: example.png>>

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Faster

Reply via email to