Hi Jean-Marc,
Sounds like what you want to do is basically eliminate stop (= noise) words and create a word index that will enable you to do proximity searches. You could start by taking a look at a paper describing the architecture of Google, at http://citeseer.nj.nec.com/brin98anatomy.html. It includes some of the basics, even a synopsis of the data structure they used. More details can be found e.g. in "How a Search Engine Works" at http://www.infotoday.com/searcher/may01/liddy.htm, or searching the web for "information retrieval", "stop words", "text indexing", "proximity search" at, of course, Google. In addition, you might also want to consider implementing something akin to the Porter stemming algorithm (http://www.tartarus.org/~martin/PorterStemmer/), which recursively removes word suffixes. This reduces the size of the word index and often improves word recall, i.e. searching for "connect" finds the related words "connected", "connecting", "connection", etc. There's also plenty of open source code out there that you can take a look at and learn from. Some names I've heard are Lucene, Swish-E, Glimpse, libibex, freeWAIS, iSearch, htDig, namazu... However, I don't know any details regarding these, so just take a peek at one that uses your choice of implementation language. Hope this gets you started. I'm working on similar lines myself, and will be glad to exchange more information. Best wishes, Arto _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
