Re: interesting phrase query issue

greg Thu, 24 Jul 2003 05:45:07 -0700

Steve,

This is great stuff. IMHO this should be the default behaviour - especially coming from someone who is new to Lucene. I can not tell you how surprised i was when someone searched for a phrase and it was returned but did not exist in the document except as "access, the manager". Your patch really did the job.

Thanks again,

Greg T Robertson

Steve Rowe writes:

Greg,

I include a patch below to StopFilter.java which should inhibit exact phrase matching across a removed stopword.

[Would it be useful for this to be the default behavior?]

See the API docs for Token.setPositionIncrement(int):

<URL:http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/ Token.html#setPositionIncrement(int)> "Set [the position increment] to values greater than one to inhibit exact phrase matches. If, for example, one does not want phrases to match across removed stop words, then one could build a stop word filter that removes stop words and also sets the increment to the number of stop words removed before each non-stop word. Then exact phrase queries will only match when the terms occur with no intervening stop words."

--------->8-----------cut here--------->8----------- Index: StopFilter.java =================================================================== RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/StopFil ter.java,v retrieving revision 1.3 diff -r1.3 StopFilter.java 64a65 > private int positionIncrement = 1; 93,94c94,97 < for (Token token = input.next(); token != null; token = input.next()) < if (table.get(token.termText) == null) --- > for (Token token = input.next(); token != null; token = input.next()) { > if (table.get(token.termText) == null) { > token.setPositionIncrement(positionIncrement); > positionIncrement = 1; // reset the position increment 95a99,102 > } else { > ++positionIncrement; // stopword -- increase the position increment > } > } --------->8-----------cut here--------->8-----------

greg wrote: > I have several document sections that are being indexed via the > StandardAnalyzer. One of these documents has the line "access, the > manager". When searching for the phrase "access manager", this document > is being returned. I understand why (at least i think i do), because a > stop word is "the" and the "," is being removed by the tokenizer, my > question is is there any way I can avoid having this returned in the > results? My thoughts were to create a new analyzer that indexes the > word "the" (blick to many of those), or index the "," in some way (also > not good). Any suggestions? > Thanks, > Greg T Robertson

-- Steve Rowe Software Engineer Center for Natural Language Processing School of Information Studies Syracuse University www.cnlp.org

--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: interesting phrase query issue

Reply via email to