That is very good performance.

But, If I take, on an average, 6 terms per user query, and looking at
shingles of size 2 I will have a boolean OR of  5 shingle phrase  queries.
How better is this compared to a single sub phrase query which would
internally be just like another phrase query with some extra operations.
Also, each shingle phrase query is going to get treated like phrase query.
right ?

Thanks
Preetam

On Tue, Jul 15, 2008 at 1:17 PM, Karl Wettin <[EMAIL PROTECTED]> wrote:

> Couldn't you create multiple "shingle phrase queries" from the user query
> and add them all to a BooleanQuery?
>
> "example input query"^10 OR
> "example input"^5 OR
> "input query"^5
>
> SpanNear and PhraseQueries are rather expensive though. Not too long ago I
> replaced phrase queries with a shingles in an index containing tens of
> millions of small documents, resulting in queries taking something like 1/10
> of the time to match with just as good results. If not even better in some
> cases.
>
>    karl
>
>
> 15 jul 2008 kl. 05.35 skrev Preetam Rao:
>
>
>  Hi Steve,
>>
>> It would be simpler if I have a query called SubPhraseQuery in which case
>> I
>> do not have to either generate extra terms during ingestion or generate
>> extra queries during querying. As a user, the best I would hope for is, to
>> ingest the data from some feed into different fields, run the user query
>> as
>> is using some set of lucene queries and get most relevant results without
>> worrying much about the internals  of scoring.
>>
>> In my case, I know that each field will most likely match some sub phrase
>> of
>> the user query and need to have a query or solr request handler which
>> handles this case.
>>
>> For cases where I care more about exact matches or sub phrase matches and
>> not about tf or idf, I think a SubPhraseQuery with the following
>> parameters
>> will be great.
>> phraseBoostFactor - factor which tells how good a n term match is than n-1
>> term match.
>> useHighestMatch - which picks only the best sub phrase match as the score
>> ignoreDuplicates - ignores duplicate sub phrase matches.
>>
>> Currently I have tried using Solr's dismax handler as well as standard
>> request handler with boosts and other parameters, but out of 3 million
>> documents, unable to get the most relevant top 5 results which is most
>> important for me.Trying to understand the scoring and fine tuning was no
>> help either. To get the most relevant top 10 results, I am willing to
>> resorting to some kind of exact match based scores rather than rely on
>> lucene's scoring formulas.
>>
>> If a query like above is in place, then one can use another solr request
>> handler similar to dismax, which uses SubPhraseQuerries instead of dismax.
>> And for a user who is more interested in exact/sub phrase matches in some
>> fields and normal boolean matches in other, the additional parameters at
>> the
>> handler level would be more useful.
>> matchOnlyOneField - matches only one field and does not use matched terms
>> on
>> another field.
>> ignoretfIdf - On a per field basis, allows one to ignore some scoring
>> calculations and just use the sub phrase scores.
>>
>> This handler combined with sub phrase query parameters can prove very
>> useful
>> in handling user queries as they are , with lot more flexibility.
>>
>> I believe, the same use case might exist for many users who have mostly
>> structured data with only some portions of it being free text, like
>> description, reviews etc and want to handle user entered queries as is
>> without resorting to query interpretation of score tuning.
>>
>> Thanks
>> Preetam
>>
>> On Mon, Jul 14, 2008 at 11:33 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote:
>>
>>  Hi Preetam,
>>>
>>> On 07/14/2008 at 1:40 PM, Preetam Rao wrote:
>>>
>>>> Is there a query in Lucene which matches sub phrases ?
>>>>
>>>>  [snip]
>>>
>>>>
>>>> I was redirected to Shingle filter which is a token filter
>>>> that spits out n-grams. But it does not seem to be best solution
>>>> since one does not know in advance what n in n-grams should be.
>>>>
>>>
>>> You could guess at the useful range, though, and then have ((max n)-(min
>>> n)+1) fields, scaling the boost for each with the corresponding value of
>>> n.
>>>
>>> Just using 2-grams could be good enough, since the longer the sub-phrase
>>> match, the more matching 2-grams.
>>>
>>>  Also it means one has to get all these bi grams and then construct
>>>> a boolean OR query which is not very efficient either.
>>>>
>>>
>>> In terms of your requirements, though, I think you're stuck with this
>>> inefficiency, no matter what solution you end up with; you need to do
>>> some
>>> form of term combination in your queries.  And the ShingleFilter approach
>>> doesn't compare badly here, since positions for phrase queries don't have
>>> to
>>> be looked up during scoring.
>>>
>>> If index space efficiency is a concern, though, the
>>> one-field-per-value-of-n solution I mentioned above could pose a problem.
>>>
>>> Steve
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to