[ 
https://issues.apache.org/jira/browse/LUCENE-8340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602299#comment-16602299
 ] 

Adrien Grand commented on LUCENE-8340:
--------------------------------------

So I went back to this patch and did some testing. I played with the 
wikimedium10m dataset and the following query (note that I had to do a hack to 
also index "lastModNDV" with a LongPoint):
{code:java}
Query boostedQ = new BooleanQuery.Builder()
                .add(new TermQuery(new Term("body", "ref")), Occur.MUST)
                .add(LongPoint.newDistanceFeatureQuery("lastModNDV", 1f, 
1335997132000L, 24 * 3600 * 1000), Occur.SHOULD) // within 1 day
                .build();
{code}
The maximum score of the term query is 2.07. The maximum score of the distance 
query is 1, and there are 582,764 documents whose timestamp is in 
[1335997132000L - 24 * 3600 * 1000, 1335997132000L + 24 * 3600 * 1000], meaning 
their score is in [0.5, 1].

When computing the top 10 matches and counting hits, all 3793973 hits must be 
visited and points are never read. This takes about 99ms.
When computing the top 10 matches but not counting hits (totalHitsThreshold=1), 
only 264802 hits are collected (7% of matches) and the query runs in 29ms.

If I switch to more costly queries that have fewer hits then the speed up 
decreases, or even becomes a slowdown unfortunately. That said I don't think it 
should prevent us from adding something like that, which is a useful addition 
to the scoring toolbox.

> Allow to boost by recency
> -------------------------
>
>                 Key: LUCENE-8340
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8340
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-8340.patch
>
>
> I would like that we support something like 
> \{{FeatureField.newSaturationQuery}} but that works with features that are 
> computed dynamically like recency or geo-distance, and is still optimized for 
> top-hits collection. I'm starting with recency because it makes things a bit 
> easier even though I suspect that geo-distance might be a more common need.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to