[ 
https://issues.apache.org/jira/browse/LUCENE-8123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315285#comment-16315285
 ] 

Wenhai commented on LUCENE-8123:
--------------------------------

Got it, thanks.




> Question about how to retrieve by TFIDFSimilarity query on lucene
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8123
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8123
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring
>    Affects Versions: 7.2
>            Reporter: Wenhai
>            Priority: Minor
>
> Hi, all.
>      Recently, we were performing experiment on Lucene based on TFIDF.
>      We want to get the similar documents from the corpus, of which the 
> similarity between each document  (d) and the given query (q) is no less than 
> a threshold. We use the following scoring function.
>     sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
>     where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).
>     We perform this query by scanning the related docIds of all terms in the 
> query, and the related docIds are derived from function  PostingsEnum docEnum 
> = MultiFields.getTermDocsEnum(indexReader, "text", term.bytes()) . After the 
> inner products of these related documents have been computed, the final 
> similarities are computed by dividing these inner products by their norms.
>     However, when the documents scale up, e.g., more than ten million titles 
> of twitter's text filed each on average has 10 terms, the runtime is 
> unacceptable (more than ten seconds) since we always need to merge 0.5~2 
> million documents to generate the inner products. Does Lucene provide more 
> efficient interface to generate ranked results based on TFIDF, or directly 
> filter out the dissimilar documents (in lucene core) for a given threshold in 
> the range of (0, 1)?
> Best
> Wenhai 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to