Hi. I've just finished my master thesis regarding how to enhance overall phrase searching in search engines nowadays. The focus in the thesis is to experiment with a new approach, whereas I've focused on pair of words (bigrams). The thesis can be freely downloaded here [1].
What I've specifically experimented with is bigrams based on stopwords and their characteristics. In this experiment there is created an Analyzer which create bigram Tokens compounded of pair of words. First we have a predefined list of stopwords, and then we analyze each token in the Analyze. Given that a stopword token is identified, then we create two new bigram tokens: 1) previouse token + stopword token 2) stopword token + next token The identified stopword token is discarded, as it pose a huge posting list in the inverted index. The overall main goal is to drastically reduce the posting lists lengths, and thereby save I/O and processing made by Apache Lucene. Based on the experiments performed, this new phrase searching approach in Lucene introduce some performance gains. The code which was created in the experiment will be made available shortly. I just need to make some Javadoc, and prettify some. There is nothing revolutionary in the code, as I've noticed by this maillist that others have also been into this subject. Hope someone finds some of the aspects discussed in my master thesis useful. I've also, into some extend, tried to describe Apache Lucene and how it works. [1] http://asbjorn.fellinghaug.com/filer/master/Master_thesis.pdf -- Asbjørn A. Fellinghaug [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]