On May 2, 2006, at 1:55 PM, [EMAIL PROTECTED] wrote:This is an issue of scaling the different dimensions.
It is more expensive to calculate similarity based on the entire document's contents rather than just a snippet chosen by the Highlighter. However, it's presumably more accurate, and having the Term Vectors pre-built at index time should help quite a bit.

This varies, actually, depending on the document.  If
you grab HTML from a portal, and use it all, pages from
that portal will tend to cluster together.  If you just
use snippets of text around document passages that
match your query, you can actually get more accurate clustering relative
to your query.  It really depends if the documents are
single-topic and coherent.  If so, use them all; if not,
use snippets.  [You can see this problem leading the
Google news classifier astray on occasion.]

That's both helpful and deflating. :\ I can imagine that if you used the complete document vector from an html document that included navigation text, the navigation text would cause the clustering. That navigation text, which cannot practically be expunged at spidering/indexing time if you are naive about the document structure, is unlikely to show up in a snippet.

A typical way to approximate is by only taking high TF/IDF
terms.

Another strike against using the existing Term Vectors, as you'd have to look them all up in the term dictionary. A stoplist could narrow things down some, but it would have to be applied at index-time if the terms were stemmed.

Principal component methods are also popular (e.g.
latent semantic indexing) to reduce dimensionality (usually
with a least-squares fit criterion).

I imagine that reducing dimensionality isn't necessary if you're using only snippets. And if you were to pre-compute LSI or similar at index-time, wouldn't you run into the same problems if your docs aren't single-topic and coherent to begin with?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to