Have a look at the position argument to PhraseQuery.add: it lets you
control where this new term is in the phrase.

So to search for "wizard of oz" when of is a stopword you would add
"wizard" at position 0 and "oz" at position 2.

This is different from slop, which allows for "fuzzy" matching of the
phrase, e.g. if you pass slop of 4 (I think) then your search for
"wizard of oz" could match a document containing "oz of wizard".

Yes, ShingleFilter bloats the index, but CommonGramsFilter lets you
only pair up a specific subset of tokens, so the bloat is much less.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jul 26, 2013 at 7:34 AM, Ankit Murarka
<ankit.mura...@rancoretech.com> wrote:
> Hello can you elaborate more on this.. I seem to be lost over here..
>
> Since I am new to lucene, so yesterday I was going through ShingleFilter and
> its application. Seems like its a kind of a N-Gram thing and it bloats the
> index as Mike have mentioned.
>
> As of now I am only concerned with the appropiate way to solve this problem.
>
> With PhraseQuery if I specify terms, then do you also want me to specify
> slop ? If I dont supply slop it default to specific search match. However
> due to stopwords this phraseQuery was not giving me any hits and hence I
> raised this question.
>
> I still dont know from where to approach this problem and how to solve this.
>
> I am sure this is definitely supported by Lucene but Perhaps a bit more
> explanation and guidance will do the trick for me.
>
>
> On 7/24/2013 6:06 PM, Michael McCandless wrote:
>>
>> With PhraseQuery you can specify where each term must occur in the phrase.
>>
>> So X must occur in position 0, David in position 1, and then manager
>> in position 4 (skipping 2 holes).
>>
>> QueryParser does this for you: when it analyzes the users phrase, if
>> the resulting tokens have holes, then it sets the positions
>> accordingly.
>>
>> And I agree: shingles are a good solution here too, but they make your
>> index larger.  CommonGramsFilter lets you shingle only specific words,
>> e.g. you could pass your stop words to it.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Jul 24, 2013 at 7:34 AM, Ankit Murarka
>> <ankit.mura...@rancoretech.com>  wrote:
>>
>>>
>>> I tried using Phrase Query with slops. Now since I am specifying the slop
>>> I
>>> also need to specify the 2nd term.
>>>
>>> In my case the 2nd term is not present. The whole string to be searched
>>> is
>>> still 1 single term.
>>>
>>> How do I skip the holes created by stopwords. I do not know before hand
>>> how
>>> many stop words are skipped and what string user is going to enter.
>>>
>>> Is there a definite way to skip the holes created by stopwords.
>>>
>>> I was now looking for MultiphraseQuery splitting the user provided string
>>> on
>>> space and providing each word as a term to multiphrasequery.
>>>
>>> Will it help..?? Is there any alternative. ??
>>>
>>>
>>> On 7/24/2013 4:48 PM, Michael McCandless wrote:
>>>
>>>>
>>>> PhraseQuery?
>>>>
>>>> You can skip the holes created by stopwords ... e.g. QueryParser does
>>>> this.  Ie, the PhraseQuery becomes "X David _ _ manager _ _ company"
>>>> if is/a/of/the are stop words, which isn't perfect (could return false
>>>> matches) but should work well in practice ...
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka
>>>> <ankit.mura...@rancoretech.com>   wrote:
>>>>
>>>>
>>>>>
>>>>> Dear All,
>>>>>
>>>>> Say suppose I have 3 documents. The sample text is
>>>>>
>>>>> /*File 1 : */
>>>>>
>>>>> Mr X David is a manager of the company. He is the senior most manager.
>>>>> I
>>>>> also want to become manager of the company.
>>>>>
>>>>> /*File 2 :*/
>>>>>
>>>>> Mr X David manager of the company is also very senior. He happens to be
>>>>> the
>>>>> senior most manager. I wish even I could reach that place.
>>>>>
>>>>> /*File 3:*/
>>>>>
>>>>> Mr X David is working for a company. He happens to be the manager of
>>>>> the
>>>>> company.Infact he is the senior most manager. I dont want to become
>>>>> like
>>>>> him.
>>>>>
>>>>> /*String I wish to search :* X David is a manager of the company./
>>>>>
>>>>> Ideally I should get only file1 in the hit result.
>>>>>
>>>>> I have no clue how to achieve this. Basically I am trying to match the
>>>>> part
>>>>> of the sentence or a complete sentence. What can be the best
>>>>> methodology.
>>>>> I presume is a are the stop words and will be skipped during indexing
>>>>> by
>>>>> the
>>>>> StandardAnalyzer.
>>>>>
>>>>> What wonders me how do I then search for a part of the sentence or
>>>>> complete
>>>>> sentence if sentence contains some/many stopwords.
>>>>>
>>>>> I am using StandardAnalyzer. Please guide.
>>>>>
>>>>> --
>>>>> Regards
>>>>>
>>>>> Ankit
>>>>>
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Regards
>>>
>>> Ankit Murarka
>>>
>>> "Peace is found not in what surrounds us, but in what we hold within."
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
>
>
> --
> Regards
>
> Ankit Murarka
>
> "Peace is found not in what surrounds us, but in what we hold within."
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to