Hi James, I can't speak for anyone else, but my experience is that the general approach is to first select a subset based on the angle between the query vector and the document vector, in their non-reduced forms (this is a normal search-for-keyword, what Lucene does by default, in vector notation). From there, you pick up the (subset) documents along with their reduced term vectors and compare their angle toward the reduced query vector. If you skip the first step, you will have one dot product (query vector and document vector) for every document in your database, but you will only need to store the reduced term vectors. That's a lot of computation, but it's necessary if you want to match documents that are related to a query but does not contain any/some of the words in it. In my experience, the advantages of this approach is a cool feature, but the hits returned are usually pretty shitty. If you don't get a hit on a normal keyword search, just leave the document (note, this is only my oppinion). Some terminology if you did not follow: "reduced" refers to the projection of a vector on to a smaller subspace (you can normally reduce the dimension / column space of the term-document matrix by ~60% and have virtually no loss of precision in your searches). See "singular value decomposition", for that matter.
Hope that helps, Fredrik On 4/20/06, James <[EMAIL PROTECTED]> wrote: > > Hi, > > We are implementing term vectors, and there is something about which I am > unclear: Can term vectors be used to perform a search in its entirety > (e.g., rank all 1 million documents in a database order, and then return > the > top 100), or, due to computational time requirements, are term vectors > only > intended to be a ranking method for a small subset of data that is the > result of a Boolean search (e.g., we know the 100 documents that possible > answers, now put them in relevancy order)? > > Thanks, > James > >
