Hi Group, I am indexing and searching a large corpus of news articles. The indexing process is very straightforward, I am utilizing the standardAnalyzer and analyzing the content of the news document. ************************** document = new Document(); document.add(new Field("snum", snum, Field.Store.YES,Field.Index.NO)); document.add(new Field("content", conent, Field.Store.NO,Field.Index.ANALYZED,Field.TermVector.YES)); indexWriter.addDocument(document);
where, "snum" is the serial number of the news article and "content" is the actual text of the document. ****************************** So far so good. The searching process is little complex as I am doing a multiple phrase searching. Let me explain the situation with an example. Suppose I have to retrieve documents which belong to the category "Software Technology" using phrase/query terms related to that topic. Also, I have around 10k phrases which belong to this particular category (e.g. "data recovery tool",....., "C++ language",...."Steve Jobs",....."Mac Layer",...."Grid Computing"...etc.). My idea was to create separate phrase query for each of these phrases and then add all of them to a boolean query. Much like this, **************************** PhraseQuery pQuery ; BooleanQuery bQuery = new BooleanQuery (); bQuery.setMaxClauseCount(10000); for (Phrase phrase : allPhrases) { String terms[] = phrase.split("\\s++"); int words = terms.length ; pQuery = new PhraseQuery(); for ( int j = 0 ; j < words ; j++) { String word = terms[j].toLowerCase(); pQuery.add(new Term("content", word)); } pQuery.setSlop(0); bQuery.add(pQuery,BooleanClause.Occur.SHOULD); } int numOfSugg = 2000 ; TopDocs matches = isearcher.search(bQuery, numOfSugg) ******************************** Unfortunately when I am searching the news content with this approach the searched results do not look very promising. A lot of top-ranked documents are not the best candidates for the "Software Technology" topic, even though they contain the phrases (not very frequent). My questions are : 1) is there anything wrong in this usage of the phrase/boolean query? 2) how I can guarantee to retrieve the most suitable news documents (i.e. document which contains a lot of the related phrases) in the top searched results? I utilized the BooleanClause.Occur.SHOULD feature (instead of the MUST) because it is impossible to find a single document containing all of the 10k phrases, but using the SHOULD feature I surmise the best results will be which contains at least a few of the phrases. thanks in advance, --d -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3461423.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org