I too have been looking at this problem, and have a slightly different
spin that makes it marginally more tractable than Devon's approach.  I
still cannot do it, however.

As Devon has indicated, the immediate problem concerns a boolean
matrix m of rank 17770 408189.  There are about 1e8 nonzero entries,
and the matrix has a density of about 12%.

It would suffice to solve the problem if one could calculate m mp |: m
and (|:m) mp m, where mp=:+/ . * . However, the matrix is too big to
effectively calculate this, even with all sorts of programming tricks.

You do not really want to calculate the entire matrix product
Instead, you only want entries that correspond to highly correlated
customers or movies.  The approach I have been taking is based on the
Tanimoto coefficient, which is used in chemical analysis and DNA
matching.

If x and y are boolean vectors, define (at least in principle)

T=:4 : '(x mp y)%(x mp x)*(y mp y)'

A value close to 1 indicates x and y are highly correlated, a value
close to 0 indicates they are not.

Note that x mp x is just +/x. There are other optimizations if the
vectors are lists of indices of nonzero values: for example, then
x mp x is #x and x mp y is (#x)+(#y)-#~.x,y .

Suppose now I have a vector a and I want to test it against the rows
of m, finding only the rows b such that t<:a T b for some threshold
value t.  Easy analysis shows that you only have to consider b
satisfying

t*(a mp a) <: (b mp b) and (b mp b)<:(a mp a)%t

By choosing an appropriately high value of t, you can seriously limit
the rows b that have to be considered.  In addition, you can use these
inequalities to find a closest match, since you can let t be the
highest Tanimoto coefficient found so far.

If anyone has suggestions as to how to program this effectively in J,
I would be very interested.  What is stopping me at the moment is that
global choice of t gives either too many or too few matches: it needs
to be adapted to a in some way I cannot see at the moment.


Best wishes,

John

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to