This is great stuff. IMHO this should be the default behaviour - especially coming from someone who is new to Lucene. I can not tell you how surprised i was when someone searched for a phrase and it was returned but did not exist in the document except as "access, the manager". Your patch really did the job.
Thanks again,
Greg T Robertson
Steve Rowe writes:
Greg,
I include a patch below to StopFilter.java which should inhibit exact phrase
matching across a removed stopword.
[Would it be useful for this to be the default behavior?]
See the API docs for Token.setPositionIncrement(int):
<URL:http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/ Token.html#setPositionIncrement(int)>
"Set [the position increment] to values greater than one to inhibit exact phrase
matches. If, for example, one does not want phrases to match across removed stop
words, then one could build a stop word filter that removes stop words and also
sets the increment to the number of stop words removed before each non-stop
word. Then exact phrase queries will only match when the terms occur with no
intervening stop words."
--------->8-----------cut here--------->8-----------
Index: StopFilter.java
===================================================================
RCS file:
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/StopFil ter.java,v
retrieving revision 1.3
diff -r1.3 StopFilter.java
64a65
> private int positionIncrement = 1;
93,94c94,97
< for (Token token = input.next(); token != null; token = input.next())
< if (table.get(token.termText) == null)
---
> for (Token token = input.next(); token != null; token = input.next()) {
> if (table.get(token.termText) == null) {
> token.setPositionIncrement(positionIncrement);
> positionIncrement = 1; // reset the position increment
95a99,102
> } else {
> ++positionIncrement; // stopword -- increase the position increment
> }
> }
--------->8-----------cut here--------->8-----------
greg wrote:
> I have several document sections that are being indexed via the
> StandardAnalyzer. One of these documents has the line "access, the
> manager". When searching for the phrase "access manager", this document
> is being returned. I understand why (at least i think i do), because a
> stop word is "the" and the "," is being removed by the tokenizer, my
> question is is there any way I can avoid having this returned in the
> results? My thoughts were to create a new analyzer that indexes the
> word "the" (blick to many of those), or index the "," in some way (also
> not good). Any suggestions?
> Thanks,
> Greg T Robertson
--
Steve Rowe
Software Engineer
Center for Natural Language Processing
School of Information Studies
Syracuse University
www.cnlp.org
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
