I too have been looking at this problem, and have a slightly different spin that makes it marginally more tractable than Devon's approach. I still cannot do it, however.
As Devon has indicated, the immediate problem concerns a boolean matrix m of rank 17770 408189. There are about 1e8 nonzero entries, and the matrix has a density of about 12%. It would suffice to solve the problem if one could calculate m mp |: m and (|:m) mp m, where mp=:+/ . * . However, the matrix is too big to effectively calculate this, even with all sorts of programming tricks. You do not really want to calculate the entire matrix product Instead, you only want entries that correspond to highly correlated customers or movies. The approach I have been taking is based on the Tanimoto coefficient, which is used in chemical analysis and DNA matching. If x and y are boolean vectors, define (at least in principle) T=:4 : '(x mp y)%(x mp x)*(y mp y)' A value close to 1 indicates x and y are highly correlated, a value close to 0 indicates they are not. Note that x mp x is just +/x. There are other optimizations if the vectors are lists of indices of nonzero values: for example, then x mp x is #x and x mp y is (#x)+(#y)-#~.x,y . Suppose now I have a vector a and I want to test it against the rows of m, finding only the rows b such that t<:a T b for some threshold value t. Easy analysis shows that you only have to consider b satisfying t*(a mp a) <: (b mp b) and (b mp b)<:(a mp a)%t By choosing an appropriately high value of t, you can seriously limit the rows b that have to be considered. In addition, you can use these inequalities to find a closest match, since you can let t be the highest Tanimoto coefficient found so far. If anyone has suggestions as to how to program this effectively in J, I would be very interested. What is stopping me at the moment is that global choice of t gives either too many or too few matches: it needs to be adapted to a in some way I cannot see at the moment. Best wishes, John ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
