Re: matching sub phrases in user entered query...

Karl Wettin Tue, 15 Jul 2008 00:48:30 -0700

Couldn't you create multiple "shingle phrase queries" from the userquery and add them all to a BooleanQuery?


"example input query"^10 OR
"example input"^5 OR
"input query"^5

SpanNear and PhraseQueries are rather expensive though. Not too longago I replaced phrase queries with a shingles in an index containingtens of millions of small documents, resulting in queries takingsomething like 1/10 of the time to match with just as good results. Ifnot even better in some cases.


    karl


15 jul 2008 kl. 05.35 skrev Preetam Rao:

Hi Steve,
It would be simpler if I have a query called SubPhraseQuery in whichcase Ido not have to either generate extra terms during ingestion orgenerateextra queries during querying. As a user, the best I would hope foris, toingest the data from some feed into different fields, run the userquery asis using some set of lucene queries and get most relevant resultswithout
worrying much about the internals  of scoring.
In my case, I know that each field will most likely match some subphrase of
the user query and need to have a query or solr request handler which
handles this case.
For cases where I care more about exact matches or sub phrasematches andnot about tf or idf, I think a SubPhraseQuery with the followingparameters
will be great.
phraseBoostFactor - factor which tells how good a n term match isthan n-1
term match.
useHighestMatch - which picks only the best sub phrase match as thescore
ignoreDuplicates - ignores duplicate sub phrase matches.

Currently I have tried using Solr's dismax handler as well as standard
request handler with boosts and other parameters, but out of 3 million
documents, unable to get the most relevant top 5 results which is most
important for me.Trying to understand the scoring and fine tuningwas no
help either. To get the most relevant top 10 results, I am willing to
resorting to some kind of exact match based scores rather than rely on
lucene's scoring formulas.
If a query like above is in place, then one can use another solrrequesthandler similar to dismax, which uses SubPhraseQuerries instead ofdismax.And for a user who is more interested in exact/sub phrase matches insomefields and normal boolean matches in other, the additionalparameters at the
handler level would be more useful.
matchOnlyOneField - matches only one field and does not use matchedterms on
another field.
ignoretfIdf - On a per field basis, allows one to ignore some scoring
calculations and just use the sub phrase scores.
This handler combined with sub phrase query parameters can provevery useful
in handling user queries as they are , with lot more flexibility.
I believe, the same use case might exist for many users who havemostly
structured data with only some portions of it being free text, like
description, reviews etc and want to handle user entered queries as is
without resorting to query interpretation of score tuning.

Thanks
Preetam
On Mon, Jul 14, 2008 at 11:33 PM, Steven A Rowe <[EMAIL PROTECTED]>wrote:
Hi Preetam,

On 07/14/2008 at 1:40 PM, Preetam Rao wrote:
Is there a query in Lucene which matches sub phrases ?
[snip]
I was redirected to Shingle filter which is a token filter
that spits out n-grams. But it does not seem to be best solution
since one does not know in advance what n in n-grams should be.
You could guess at the useful range, though, and then have ((max n)-(minn)+1) fields, scaling the boost for each with the correspondingvalue of n.
Just using 2-grams could be good enough, since the longer the sub-phrase
match, the more matching 2-grams.
Also it means one has to get all these bi grams and then construct
a boolean OR query which is not very efficient either.
In terms of your requirements, though, I think you're stuck with this
inefficiency, no matter what solution you end up with; you need todo someform of term combination in your queries. And the ShingleFilterapproachdoesn't compare badly here, since positions for phrase queriesdon't have to
be looked up during scoring.

If index space efficiency is a concern, though, the
one-field-per-value-of-n solution I mentioned above could pose aproblem.
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: matching sub phrases in user entered query...

Reply via email to