[jira] [Updated] (LUCENE-5288) Add ProxBooleanTermQuery, like BooleanQuery but boosting when term occur "close" together (in proximity) in each document

Michael McCandless (JIRA) Fri, 18 Oct 2013 03:27:24 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-5288:
---------------------------------------

    Attachment: LUCENE-5288.patch

OK, here's a totally new patch, trying out a different approach, that
does rescoring during collection, and uses Scorer.getChildren to
locate all proximity aware term queries.

A big advantage of this approach is it's now query agnostic: any Query
can add in prox boosting.  In order to do so, it must use
ProxTermQuery wherever it previously used TermQuery, and then use
RescoringCollector passing ProximityRescorer.

Another big advantage is this approach can short-circuit the prox
boosting by realizing that even after prox boosting a given hit cannot
compete, and avoid visiting any positions for that it.  I haven't
implemented that yet, but the hooks are there to do so ... this should
typical result in big gains over the dedicated ProxBooleanTermQuery.

However, one big downside of this approach is, because it's more
divorced from how the actual query did its scoring, it can be much
slower in some cases because for every hit it must visit ALL terms to
see if they are "on" the current doc, so there's now an automatic
O(N) cost per collect, where N = number of prox terms.

Another downside is this approach has bigger API surface area, but
another possible advantage is maybe we can share APIs with the
"collect top N and resort TopDocs" use case.

Not sure which is better...


> Add ProxBooleanTermQuery, like BooleanQuery but boosting when term occur 
> "close" together (in proximity) in each document
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5288
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5288
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.6, 5.0
>
>         Attachments: LUCENE-5288.patch, LUCENE-5288.patch, LUCENE-5288.patch
>
>
> This is very much a work in progress, tons of nocommits...  It adds two 
> classes:
>   * ProxBooleanTermQuery: like BooleanQuery (currently, all clauses
>     must be TermQuery, and only Occur.SHOULD is supported), which is
>     essentially a BooleanQuery (same matching/scoring) except for each
>     matching docs the positions are merge-sorted and scored to "boost"
>     the document's score
>   * QueryRescorer: simple API to re-score top hits using a different
>     query.  Because ProxBooleanTermQuery is so costly, apps would
>     normally run an "ordinary" BooleanQuery across the full index, to
>     get the top few hundred hits, and then rescore using the more
>     costly ProxBooleanTermQuery (or other costly queries).
> I'm not sure how to actually compute the appropriate prox boost (this
> is the hard part!!) and I've completely punted on that in the current
> patch (it's just a hack now), but the patch does all the "mechanics"
> to merge/visit all the positions in order per hit.
> Maybe we could do the similar scoring that SpanNearQuery or sloppy
> PhraseQuery would do, or maybe this paper:
>   http://plg.uwaterloo.ca/~claclark/sigir2006_term_proximity.pdf
> which Rob also used in LUCENE-4909 to add proximity scoring to
> PostingsHighlighter.  Maybe we need to make it (how the prox boost is
> computed/folded in) somehow pluggable ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5288) Add ProxBooleanTermQuery, like BooleanQuery but boosting when term occur "close" together (in proximity) in each document

Reply via email to