[jira] [Commented] (LUCENE-5288) Add ProxBooleanTermQuery, like BooleanQuery but boosting when term occur "close" together (in proximity) in each document

Michael McCandless (JIRA) Wed, 16 Oct 2013 11:10:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797072#comment-13797072
 ]


Michael McCandless commented on LUCENE-5288:
--------------------------------------------

{quote}

bq. Maybe we could do the similar scoring that SpanNearQuery or sloppy 
PhraseQuery would do

Please do not do this. If someone wants that, then they can simply use those 
queries. they already exist.
{quote}

Honestly I have no idea how to properly "boost" for proximity; this is
all one big nocommit now.

Matching how SpanNear and sloppy PhraseQuery score would at least be
"familiar" (though I don't think it'd match exactly).  I think this
impl should be faster than using N SpanNearQuery though ... have to
test.

bq. I do not think we should be adding more scorers that score in the crazy 
ways these queries do: its 2013.

OK fair enough.  It was just an idea... I just don't know what to do
with the scoring side of this.

bq. Can we avoid the 'disableCoord'? Can we avoid the 'disableCoord'? I dont 
think we should add negatives like omitXXX or disableXXX anymore.

You mean name it enableCoord instead?  Or don't even expose the
option?

bq. Why does the one in BQ need to be protected?

Hmm, it doesn't, I'll fix.  At one point I had tried subclassing BQ...

bq. Why is the heap modification code manually inlined into proxscorer? Is this 
for some performance gain? If so, what is the improvement?

No good reason... I'll add a nocommit to subclass PQ "properly".

bq. Why does proxscorer score as S + P (where P is some proximity boost) if 
QueryRescorer is already scoring as S + (W*S2)?

The S+P is a total hack now; I have no idea what the "right" way is.

bq. This seems to make the entire query unnecessary: its booleanness is not 
needed, in fact, not wanted as it will just double-count normal scoring. and we 
just need something more like a phrase query with different scoring?

Well, I think we should leave open the option to run this query on the
entire index?  (And to also be able to use it only for rescoring
topN).

I agree the double-boolean-scoring is messed up ... not sure how to
fix.

Maybe in QueryRescorer we need to change the formula to W1*S1 + W2*S2,
then you could set W1=0... not sure ...

But that is a neat idea to do rescoring with a PhraseQuery... or with
SpanNearQuery, etc.  Maybe we should break out QueryRescorer into its
own issue?


> Add ProxBooleanTermQuery, like BooleanQuery but boosting when term occur 
> "close" together (in proximity) in each document
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5288
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5288
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.6, 5.0
>
>         Attachments: LUCENE-5288.patch
>
>
> This is very much a work in progress, tons of nocommits...  It adds two 
> classes:
>   * ProxBooleanTermQuery: like BooleanQuery (currently, all clauses
>     must be TermQuery, and only Occur.SHOULD is supported), which is
>     essentially a BooleanQuery (same matching/scoring) except for each
>     matching docs the positions are merge-sorted and scored to "boost"
>     the document's score
>   * QueryRescorer: simple API to re-score top hits using a different
>     query.  Because ProxBooleanTermQuery is so costly, apps would
>     normally run an "ordinary" BooleanQuery across the full index, to
>     get the top few hundred hits, and then rescore using the more
>     costly ProxBooleanTermQuery (or other costly queries).
> I'm not sure how to actually compute the appropriate prox boost (this
> is the hard part!!) and I've completely punted on that in the current
> patch (it's just a hack now), but the patch does all the "mechanics"
> to merge/visit all the positions in order per hit.
> Maybe we could do the similar scoring that SpanNearQuery or sloppy
> PhraseQuery would do, or maybe this paper:
>   http://plg.uwaterloo.ca/~claclark/sigir2006_term_proximity.pdf
> which Rob also used in LUCENE-4909 to add proximity scoring to
> PostingsHighlighter.  Maybe we need to make it (how the prox boost is
> computed/folded in) somehow pluggable ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5288) Add ProxBooleanTermQuery, like BooleanQuery but boosting when term occur "close" together (in proximity) in each document

Reply via email to