At 1:21 PM -0400 7/12/01, Geoff Hutchison wrote:
>On Thu, 12 Jul 2001, Gilles Detillieux wrote:
>
>> According to Elizabeth Taylor:
>> > I am trying to find out if Ht://Dig uses reverse indexing and boolean
>> > logic to obtain search results or if it also uses vector space
>> > retrieval?
>
>Hrm. I guess the reply I wrote didn't manage to get to the mailing
>list? I love e-mail. <rolls eyes>
>
>No one really does vector space indexing because it's just too
>inefficient, either in terms of space required for the index or for
>indexing (or both). For those of you curious, vector-space indexing
>basically means you'd have a vector of the words (or word ids) in a given
>document. So you can take the "distance" between a given query and each
>document.
Actually, vector indexing is really good for document similarity
matching (compare vectors) and is often used for information
filtering and automatic classification. It does require more
resources but disks and RAM are so much cheaper than they used to
be...
> > index that tells htsearch essentially which document contain a given word.
>> I believe this is known as a reverse index, although the term doesn't
>
>It's also called an "inverted index," for the reason that you've turned
>the text on its head--from pages of words to words pointing to
>pages/documents.
Right, "inverted index" is the common term. Many search engines
store this in disk files and do their own memory management, but I
think ht://dig uses mySQL for this, am I right?
>In any case, almost every search engine that I know of uses an inverted
>index for the word database and then constructs some form of boolean
>query as Gilles mentioned. However, it's not strictly a boolean query in
>the traditional information retrieval sense, because search engines do
>rankings once they limit the results of the query, while a search at your
>local library probably just gives you all the matches in, say,
>alphabetical or chronological order.
Right, strict "Boolean" information retrieval means that a yes-no
decision about whether to include an item in the search results,
based on matches to the query as expressed with Boolean operators
(AND, OR, AND NOT). Relevance ranking is a different issue.
Vector search engines allow for shades of grey. I just did a project
on filtering newspaper stories. It's easy to tell if a story is
about tennis or agriculture or a particular city, much harder to
define a query that identifies stories about "home and garden" or
"local news". You end up having to build long complex queries based
on existing stories with extra weight for certain words. Also long
queries are cheaper to perform when using vector comparisons rather
than doing a bunch of lookups and then finding the right intersection.
So it's a question of the right tool for the right purpose. Word
matching engines such as ht://Dig are definitely right for web sites
where the queries are so short.
Avi
--
_________________________________________________
Complete Guide to Search Engines for Web Sites, Intranets,
and Portals: <http://www.searchtools.com>
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html