[jira] [Updated] (LUCENE-5288) Add ProxBooleanTermQuery, like BooleanQuery but boosting when term occur "close" together (in proximity) in each document

Michael McCandless (JIRA) Fri, 18 Oct 2013 12:44:43 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-5288:
---------------------------------------

    Attachment: LUCENE-5288.patch

bq. I think it would be far better to start with an API that takes Term[] 
optionally, but can compute it itself with Query.extractTerms(LinkedHashSet), 
which will work in many cases.

OK, here's a new patch that uses Query.extractTerms to get the terms, and then 
it creates separate DocsAndPositionsEnums.

I don't like that this means we are pulling two enums per term (one for the 
TermQuery and one for the prox scoring) vs the previous patches, but hopefully 
this turns out to be in the noise, once we work out the short-circuit opto.

Tons of nocommits still.

> Add ProxBooleanTermQuery, like BooleanQuery but boosting when term occur 
> "close" together (in proximity) in each document
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5288
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5288
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.6, 5.0
>
>         Attachments: LUCENE-5288.patch, LUCENE-5288.patch, LUCENE-5288.patch, 
> LUCENE-5288.patch
>
>
> This is very much a work in progress, tons of nocommits...  It adds two 
> classes:
>   * ProxBooleanTermQuery: like BooleanQuery (currently, all clauses
>     must be TermQuery, and only Occur.SHOULD is supported), which is
>     essentially a BooleanQuery (same matching/scoring) except for each
>     matching docs the positions are merge-sorted and scored to "boost"
>     the document's score
>   * QueryRescorer: simple API to re-score top hits using a different
>     query.  Because ProxBooleanTermQuery is so costly, apps would
>     normally run an "ordinary" BooleanQuery across the full index, to
>     get the top few hundred hits, and then rescore using the more
>     costly ProxBooleanTermQuery (or other costly queries).
> I'm not sure how to actually compute the appropriate prox boost (this
> is the hard part!!) and I've completely punted on that in the current
> patch (it's just a hack now), but the patch does all the "mechanics"
> to merge/visit all the positions in order per hit.
> Maybe we could do the similar scoring that SpanNearQuery or sloppy
> PhraseQuery would do, or maybe this paper:
>   http://plg.uwaterloo.ca/~claclark/sigir2006_term_proximity.pdf
> which Rob also used in LUCENE-4909 to add proximity scoring to
> PostingsHighlighter.  Maybe we need to make it (how the prox boost is
> computed/folded in) somehow pluggable ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5288) Add ProxBooleanTermQuery, like BooleanQuery but boosting when term occur "close" together (in proximity) in each document

Reply via email to