Re: [htdig] basis of search

nets Thu, 12 Jul 2001 13:01:18 -0700
At 1:21 PM -0400 7/12/01, Geoff Hutchison wrote:
>On Thu, 12 Jul 2001, Gilles Detillieux wrote:
>
>>  According to Elizabeth Taylor:
>>  > I am trying to find out if Ht://Dig uses reverse indexing and boolean
>>  > logic to obtain search results or if it also uses vector space
>>  > retrieval?
>
>Hrm. I guess the reply I wrote didn't manage to get to the mailing
>list? I love e-mail. <rolls eyes>
>
>No one really does vector space indexing because it's just too
>inefficient, either in terms of space required for the index or for
>indexing (or both). For those of you curious, vector-space indexing
>basically means you'd have a vector of the words (or word ids) in a given
>document. So you can take the "distance" between a given query and each
>document.

Actually, vector indexing is really good for document similarity 
matching (compare vectors) and is often used for information 
filtering and automatic classification.  It does require more 
resources but disks and RAM are so much cheaper than they used to 
be...

>  > index that tells htsearch essentially which document contain a given word.
>>  I believe this is known as a reverse index, although the term doesn't
>
>It's also called an "inverted index," for the reason that you've turned
>the text on its head--from pages of words to words pointing to
>pages/documents.

Right, "inverted index" is the common term.  Many search engines 
store this in disk files and do their own memory management, but I 
think ht://dig uses mySQL for this, am I right?

>In any case, almost every search engine that I know of uses an inverted
>index for the word database and then constructs some form of boolean
>query as Gilles mentioned. However, it's not strictly a boolean query in
>the traditional information retrieval sense, because search engines do
>rankings once they limit the results of the query, while a search at your
>local library probably just gives you all the matches in, say,
>alphabetical or chronological order.

Right, strict "Boolean" information retrieval means that a yes-no 
decision about whether to include an item in the search results, 
based on matches to the query as expressed with Boolean operators 
(AND, OR, AND NOT).  Relevance ranking is a different issue.

Vector search engines allow for shades of grey.  I just did a project 
on filtering newspaper stories.  It's easy to tell if a story is 
about tennis or agriculture or a particular city, much harder to 
define a query that identifies stories about "home and garden" or 
"local news".   You end up having to build long complex queries based 
on existing stories with extra weight for certain words.  Also long 
queries are cheaper to perform when using vector comparisons rather 
than doing a bunch of lookups and then finding the right intersection.

So it's a question of the right tool for the right purpose.  Word 
matching engines such as ht://Dig are definitely right for web sites 
where the queries are so short.

Avi
-- 
_________________________________________________
Complete Guide to Search Engines for Web Sites, Intranets, 
   and Portals: <http://www.searchtools.com>

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] basis of search

Reply via email to