On Thu, May 04, 2006 at 10:52:46AM -0400, Daniel Shane wrote: > I'm developing a new type of Query, called a SubPhraseQuery. I have sent > a message to the list regarding this and Doug was kind enough to put me > on the right track. The query is simply a PhraseQuery where all terms > are search, but, if any of the subphrases are found, it boosts the > results the longer the subphrase is. I can't help on the analyzing portion, but I can show you an alternative implementation.
We use Lucene to power the search behind isohunt.com, and I came up with a different way of doing what you want. It's got less in the way of magic constants, and more in the way of using existing Lucene functionality. It's got one difference from yours, in that the terms are allowed to occur in any order in the sub-phrases (so phrase "C B" from your original example is scored like "B C"). If the query is a boolean query, it's a candidate for transmuting. Otherwise it's just used as is. /* Puesdo-code follows */ static Query transmuteBooleanQueryToSpanQuery(BooleanQuery bq) 1. Set required = get all terms with BOoleanClause.Occur.MUST. 2. Set optional = get all terms with BOoleanClause.Occur.SHOULD. 3. If the sum of the size of the two sets is <= 1, just return (safety case). 4. SpanTermQuery stq[] = (construct for a SpanTermQuery for each item in the above sets). 5. This is the bit of magic here: Define a value 'proximity' using the size of the sets above. We use required.size*3 + optional.size*2 + 5. 5. snq = new SpanNearQuery(stq,proximity,false); 6. bq.add(snq, BooleanClause.Occur.SHOULD); 7. return bq; -- Robin Hugh Johnson E-Mail : [EMAIL PROTECTED] Home Page : http://www.orbis-terrarum.net/?l=people.robbat2 ICQ# : 30269588 or 41961639 GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85
pgpKslcxbZXIB.pgp
Description: PGP signature