If however you want "phone the boy" to match "phone X boy" where X is any word, then PhraseQuery would have to be extended. It's actually a pretty simple extension. Each term in a PhraseQuery corresponds to a PhrasePositions object. The 'offset' field within this is the position of the term in the phrase. If you construct the phrase positions for a two-term phrase so that the first has offset=0 and the second offset=2, then you'll get this sort of matching. So all that's needed is a new method PhraseQuery.add(Term term, int offset), and for these offsets to be stored so that they can be used when building PhrasePositions. Would this be a useful feature?
My questions were really from an academic understanding nature about position increments and how it related to searching. I definitely agree (and who could argue?) with Nutch and Google! Removing stop words is not a good thing, but smart handling of pervasive terms is important as you have implemented in Nutch when not doing phrase queries and how the bi-gram stuff works.
It does seem handy to avoid exact phrase matches on "phone boy" when a stop word is removed though, so patching StopFilter to put in the missing positions seems reasonable to me currently. Any objections to that?
Erik
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
