Hello Uwe,
Thank you for the reply. I see that there is a version check for the use of
setEnablePositionIncrements(false); and, I think I may be able to use an
earlier api with the eXist-db embedding of Lucene 4.4 to avoid the version
check.
However, my question was intended to improve my understanding of how to
properly use stop words and/or how to properly achieve the use case that I
outlined.
My naive understanding of the purpose of stop words is:
to remove from indexing words that are not helpful in discriminating or
selecting documents since they occur so frequently.
The use case that I intended to illustrate is meant to ignore the occurrence or
non-occurrence of stop words in a query w.r.t. selection of documents during
search.
As I understand the situation currently, occurrences of stop words in a query
phrase are replaced by "?"s to indicate the presence of an otherwise
unspecified word in the query. So the phrase:
blue is the moon
with "is" and "the" as stop words, would be indexed effectively as:
blue ? ? moon
and the query phrase:
blue was a moon
would be treated as:
blue ? ? moon
and would retrieve a document containing:
blue is the moon
But in the use case that I presented we really want the query:
blue moon
to also select the document without the user having to indicate the possible
presence of stop words or not.
So my question is:
How is one supposed to achieve this use case in Lucene 4.4+?
Thank you,
Chris
On Apr 24, 2014, at 5:52 AM, Uwe Schindler <[email protected]> wrote:
> Hi,
>
> You can still change the setting on the TokenFilter after creating it:
> StopFilter#setEnablePositionIncrements(false) - this method was *not* removed!
> This fails only is you pass matchVersion>=Version.LUCENE_44. Just use an
> older matchVersion parameter to the constructor and you can still enable this
> broken behavior (for backwards compatibility).
>
> This is no longer officially supported, but can be a workaround. To me it
> looks like you misunderstood stopwords.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
>> -----Original Message-----
>> From: Tincu Gabriel [mailto:[email protected]]
>> Sent: Thursday, April 24, 2014 12:27 PM
>> To: [email protected]
>> Subject: Re: What is the proper use of stop words in Lucene?
>>
>> Hi there,
>> The StopFilterFactory can be used to produce StopFilters with the desired
>> stop-words inside of it . As a constructor argument it takes a
>> Map<String,String> and one of the valid keys you can pass inside of that is
>> "enablePositionIncrements" . If you don't pass that in then it defaults to
>> true.
>> Is this what you were looking for?
>>
>>
>> On Wed, Apr 23, 2014 at 12:36 PM, Chris Tomlinson <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> I've written several times now on the list with this question /
>>> problem and no one has yet replied so I don't know if the question is
>>> too wrong-headed or if there is simply no one reading the list that
>>> can comment on the question.
>>>
>>> The question that I'm trying to get answered is what is the correct
>>> way of ignoring stop word gaps in Lucene 4.4+?
>>>
>>> While we are using Lucene 4.4 embedded in eXist-db (exist-db.org), I
>>> think the question is a proper Lucene question and really has nothing
>>> to do with the fact that we're using it in an embedded manner.
>>>
>>> The problem to be solved is how to ignore stop word gaps in queries -
>>> without the user having to indicate where such gaps might occur at
>>> query time.
>>>
>>> Since Lucene 4.4 the
>>> FilteringTokenFilter.setEnablePositionIncrements(false) is not available.
>>> None of the resources such as the "Lucene in Action" and so on explain
>>> how to use Lucene to get the desired effect now that 4.4+ has removed
>>> the previous approach.
>>>
>>> Prior to Lucene 4.4 it was possible to
>>> setEnablePositionIncrements(false)
>>> so that during indexing and querying the number and position of stop
>>> word gaps would be ignored (as mentioned on pp 138-139 of "Lucene in
>> Action").
>>>
>>> This meant that a document with a phrase such as:
>>>
>>> blue is the sky
>>>
>>> with stop words "is" and "the" would be selected by the query:
>>>
>>> blue sky
>>>
>>> This is what we want to achieve.
>>>
>>> Why? We are working with Tibetan and elisions are not uncommon so
>>> that,
>>> e.g.:
>>>
>>> rin po che
>>>
>>> on some occasions might be shortened to
>>>
>>> rin che
>>>
>>> and we would like to have a query of
>>>
>>> rin po che
>>>
>>> or
>>>
>>> rin che
>>>
>>> find all occurrences of
>>>
>>> rin po che
>>>
>>> and
>>>
>>> rin che
>>>
>>> without having the user have to mark where elisions might occur.
>>>
>>> The
>>>
>> org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfi
>>> guration provides a setEnablePositionIncrements but that does not seem
>>> to work to allow for the above desired query behavior that was
>>> possible prior to Lucene 4.4.
>>>
>>> What is the proper way to ignore stop word gaps?
>>>
>>> Thank you,
>>> Chris
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]