Re: Search a Part of the Sentence/Complete sentence in lucene 4.3

Ankit Murarka Sat, 27 Jul 2013 00:21:46 -0700

Ok.I went through the Javadoc of PhraseQuery and tried using positionargument to phrasequery.


Problem encountered:

My text contains : Still it is not happening and generally i will beable to complete it at the earliest.

The user enters search string : 1. still happening and 2. still it isnot happening.

Now, based on what I understood for the first input, I will add still at0 and happening at 1 of the phrasequery position. This will not give meany hit.

For second input, do I still need to add still at 0 and happening at 4to phrasequery position ? This will mean I need to store locally thestopwords and every user input will then need to be parsed for stopwordsand extract required terms. This might not be a feasible solutionanyday. Parsing user input to discard stopwords and then search..However, this is giving me HIT but not at all recommended to implementby parsing user input.



On 7/26/2013 6:49 PM, Michael McCandless wrote:

Have a look at the position argument to PhraseQuery.add: it lets you
control where this new term is in the phrase.

So to search for "wizard of oz" when of is a stopword you would add
"wizard" at position 0 and "oz" at position 2.

This is different from slop, which allows for "fuzzy" matching of the
phrase, e.g. if you pass slop of 4 (I think) then your search for
"wizard of oz" could match a document containing "oz of wizard".

Yes, ShingleFilter bloats the index, but CommonGramsFilter lets you
only pair up a specific subset of tokens, so the bloat is much less.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jul 26, 2013 at 7:34 AM, Ankit Murarka
<[email protected]>  wrote:

Hello can you elaborate more on this.. I seem to be lost over here..

Since I am new to lucene, so yesterday I was going through ShingleFilter and
its application. Seems like its a kind of a N-Gram thing and it bloats the
index as Mike have mentioned.

As of now I am only concerned with the appropiate way to solve this problem.

With PhraseQuery if I specify terms, then do you also want me to specify
slop ? If I dont supply slop it default to specific search match. However
due to stopwords this phraseQuery was not giving me any hits and hence I
raised this question.

I still dont know from where to approach this problem and how to solve this.

I am sure this is definitely supported by Lucene but Perhaps a bit more
explanation and guidance will do the trick for me.


On 7/24/2013 6:06 PM, Michael McCandless wrote:

With PhraseQuery you can specify where each term must occur in the phrase.

So X must occur in position 0, David in position 1, and then manager
in position 4 (skipping 2 holes).

QueryParser does this for you: when it analyzes the users phrase, if
the resulting tokens have holes, then it sets the positions
accordingly.

And I agree: shingles are a good solution here too, but they make your
index larger.  CommonGramsFilter lets you shingle only specific words,
e.g. you could pass your stop words to it.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 24, 2013 at 7:34 AM, Ankit Murarka
<[email protected]>   wrote:

I tried using Phrase Query with slops. Now since I am specifying the slop
I
also need to specify the 2nd term.

In my case the 2nd term is not present. The whole string to be searched
is
still 1 single term.

How do I skip the holes created by stopwords. I do not know before hand
how
many stop words are skipped and what string user is going to enter.

Is there a definite way to skip the holes created by stopwords.

I was now looking for MultiphraseQuery splitting the user provided string
on
space and providing each word as a term to multiphrasequery.

Will it help..?? Is there any alternative. ??


On 7/24/2013 4:48 PM, Michael McCandless wrote:

PhraseQuery?

You can skip the holes created by stopwords ... e.g. QueryParser does
this.  Ie, the PhraseQuery becomes "X David _ _ manager _ _ company"
if is/a/of/the are stop words, which isn't perfect (could return false
matches) but should work well in practice ...

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka
<[email protected]>    wrote:

Dear All,

Say suppose I have 3 documents. The sample text is

/*File 1 : */

Mr X David is a manager of the company. He is the senior most manager.
I
also want to become manager of the company.

/*File 2 :*/

Mr X David manager of the company is also very senior. He happens to be
the
senior most manager. I wish even I could reach that place.

/*File 3:*/

Mr X David is working for a company. He happens to be the manager of
the
company.Infact he is the senior most manager. I dont want to become
like
him.

/*String I wish to search :* X David is a manager of the company./

Ideally I should get only file1 in the hit result.

I have no clue how to achieve this. Basically I am trying to match the
part
of the sentence or a complete sentence. What can be the best
methodology.
I presume is a are the stop words and will be skipped during indexing
by
the
StandardAnalyzer.

What wonders me how do I then search for a part of the sentence or
complete
sentence if sentence contains some/many stopwords.

I am using StandardAnalyzer. Please guide.

--
Regards

Ankit

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



--
Regards

Ankit Murarka

"Peace is found not in what surrounds us, but in what we hold within."


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



--
Regards

Ankit Murarka

"Peace is found not in what surrounds us, but in what we hold within."


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



--
Regards

Ankit Murarka

"Peace is found not in what surrounds us, but in what we hold within."


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Search a Part of the Sentence/Complete sentence in lucene 4.3

Reply via email to