On 21-4-2007 1:42 Mark Kirkwood wrote:
I don't think that will work for the vector norm i.e:

|x - y| = sqrt(sum over j ((x[j] - y[j])^2))

I don't know if this is usefull here, but I was able to rewrite that algorithm for a set of very sparse vectors (i.e. they had very little overlapping factors) to something like:
|x - y| = sum over j (x[j]^2) + sum over j (y[j]^2)
+ for each j where x[j] and y[j] are both non-zero: - (x[j]^2 + y[j]^2) + (x[j] - y[j])^2

The first two parts sums can be calculated only once. So if you have very little overlap, this is therefore much more efficient (if there is no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this rewritten calculation allows you to store the X and Y vectors using a trivial table-layout vector(x,i,value) which is only filled with non-zero's and which you can trivially self-join to find the closest matches. You don't care about the j's where there is either no x or y-value anyway with this rewrite.

I can compare over 1000 y's of on average 100 elements to two x's of over 1000 elements on just a single 1.8Ghz amd processor. (I use it for a bi-kmeans algorithm, so there are only two buckets to compare to).

So it might be possible to rewrite your algorithm to be less calculation-intensive. Obviously, with a dense-matrix this isn't going to work, but there may be other ways to circumvent parts of the algorithm or to cache large parts of it. It might also help to extract only the 6 relevant columns into a seperate temporary table which will have much smaller records and thus can fit more records per page.

Best regards,

Arjen

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to