Hi Jean-Marc,

Sounds like what you want to do is basically eliminate stop (= noise)
words and create a word index that will enable you to do proximity
searches.

You could start by taking a look at a paper describing the architecture
of Google, at http://citeseer.nj.nec.com/brin98anatomy.html. It includes
some of the basics, even a synopsis of the data structure they used.
More details can be found e.g. in "How a Search Engine Works" at
http://www.infotoday.com/searcher/may01/liddy.htm, or searching the web
for "information retrieval", "stop words", "text indexing", "proximity
search" at, of course, Google.

In addition, you might also want to consider implementing something akin
to the Porter stemming algorithm
(http://www.tartarus.org/~martin/PorterStemmer/), which recursively
removes word suffixes. This reduces the size of the word index and often
improves word recall, i.e. searching for "connect" finds the related
words "connected", "connecting", "connection", etc.

There's also plenty of open source code out there that you can take a
look at and learn from. Some names I've heard are Lucene, Swish-E,
Glimpse, libibex, freeWAIS, iSearch, htDig, namazu... However, I don't
know any details regarding these, so just take a peek at one that uses
your choice of implementation language.

Hope this gets you started. I'm working on similar lines myself, and
will be glad to exchange more information.

Best wishes,

Arto


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to