>>> inline
________________________________ From: Ted Dunning <[email protected]> To: [email protected] Sent: Wednesday, 24 June, 2009 7:37:38 Subject: Re: LSI, cosine and others which use vectors On Tue, Jun 23, 2009 at 8:05 PM, Paul Jones <[email protected]>wrote: > tks Ted, but if its a live system, and you have 10 million documents, then > isn't the computation on the fly going to be a pain, if you add say 1000 > docs per hour or whatever, which is why I was assuming that its a batch > process. > That really depends on what you are doing. Mostly, it turns out vastly better. For instance, if you have 10^7 documents as you say, then you have 10^14 distances to think about. In fact, most of those distances are large enough that you can ignore them so it is nice to have a (notional/virtual) sparse similarity matrix instead. The cost of computing the entire doc-doc matrix is substantial and more than you can typically do on the fly, but you almost never need the entire similarity matrix. So it is cheaper in many cases to just compute the part that you need on the fly. >>>> I don't follow how would you do just part of the matrix, i.e if you have >>>> created the term-doc, surely the weighted data would need to be weighted >>>> across the whole set, in order to give you relevance, not sure how you >>>> would slice through a subset of the matrix you want to work with. There are exceptions, notably in recommendations where it pays to compute a large number of item-item similarities, sparsify the result and store it as a binary vector. [snip] Indeed. If you look at your problem again, there are some interesting connections between what you want to do and matrix algebra. For instance, let's call your document x term matrix of counts A. Then the document level cooccurrence matrix is A' A. It often helps to use inverse-document frequency scaled values instead of counters here or to post-process the raw counts to find anomalously large one. The next step, however, is to not just find words that co-occur, but to find words that have similar patterns of cooccurrence with each other. If two words are interchangeable, it is reasonable to expect that they wouldn't ever occur together, but you would expect that they would co-occur with other words in just the same way. You can find words that have similar cooccurrence patterns by comparing rows (or columns) of the cooccurrence matrix with other rows of the smae matrix. Using a matrix product does this wholesale: (A' A)' (A' A) = A'A A'A If you decompose A using singular values, you get A = U' s V so A' A = V' s^2 V and A' A A' A = V' s^4 V. What will happen here is that the vectors with the largest few singular values will dominate so you can approximate this result by doing a term level random walk. while (many repetitions) { Term t = random term d = random document containing t u1 = random word in d d = random document containing u1 u2 = random word in d count([t,u2]) } Pairs that have lots of counts will be the related words you want. Note how it is very easy to implement this using a doc x term matrix, especially one implemented using something like Lucene. >>>> Okay aside from being confused with matrix algebra :-), am confused with >>>> the "easy to implement using a doc x term matrix", i.e not sure how a >>>> doc-term matrix would work out the similiarity between words, is it not >>>> working out the occurrence of words in a doc. Maybe I am >>>> misunderstanding...Lets say I have a matrix built, where the docs are the >>>> columns, and the words as rows, now my limited understanding from what I >>>> have read says that this matrix can be represented as a number of vectors, >>>> eg lets say we have one document, with 3 words, then the x/y/z axis will >>>> represent each word and its freq of occurence, and hence the point in >>>> space forming the vector depicts this word related to that document. And this can be expanded. Now if we have 2 documents, with 2 more words, we have another point. The distance between them shows how similar they are, and hence how similar the document is too each other. So far so good, but I am unsure how this translates in showing how similar the words are themselves, i.e co-occurence, would that not have to be a term-term matrix [snip] Tks again, Paul
