[ 
https://issues.apache.org/jira/browse/LUCENE-8123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenhai updated LUCENE-8123:
---------------------------
    Description: 
Hi, all.
     Recently, we were performing experiment on Lucene based on TFIDF.
     We want to get the similar documents from the corpus, of which the 
similarity between each document  (d) and the given query (q) is no less than a 
threshold. We use the following scoring function.
    sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
    where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).

    We perform this query by scanning the related docIds of all terms in the 
query, and the related docIds are derived from function  PostingsEnum docEnum = 
MultiFields.getTermDocsEnum(indexReader, "text", term.bytes()) . After the 
inner products of these related documents have been computed, the final 
similarities are computed by dividing these inner products by their norms.

    However, when the documents scale up, e.g., more than ten million titles of 
twitter's text filed each on average has 10 terms, the runtime is unacceptable 
(more than ten seconds) since we always need to merge 0.5~2 million documents 
to generate the inner products. Does Lucene provide more efficient interface to 
generate ranked results based on TFIDF?

Best
Wenhai 

  was:
Hi, all.
     Recently, we were performing experiment on Lucene based on TFIDF.
     We want to get the similar documents from the corpus, of which the 
similarity between each document  (d) and the given query (q) is no less than a 
threshold. We use the following scoring function.
    sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
    where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).

    We perform this query by scanning the related docIds of all terms in the 
query, and the related docIds are derived from function  PostingsEnum docEnum = 
MultiFields.getTermDocsEnum(indexReader, "text", term.bytes()) . After the 
inner products of these related documents have been computed, the final 
similarities are computed by dividing these inner products by their norms.

    However, when the documents scale up, e.g., more than ten million 
documents, the runtime is unacceptable (more than ten seconds) since we always 
need to merge 0.5~2 million documents to generate the inner products. Does 
Lucene provide more efficient interface to generate ranked results based on 
TFIDF?

Best
Wenhai 


> Question about how to retrieve by TFIDFSimilarity query on lucene
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8123
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8123
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring
>    Affects Versions: 7.2
>            Reporter: Wenhai
>            Priority: Minor
>
> Hi, all.
>      Recently, we were performing experiment on Lucene based on TFIDF.
>      We want to get the similar documents from the corpus, of which the 
> similarity between each document  (d) and the given query (q) is no less than 
> a threshold. We use the following scoring function.
>     sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
>     where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).
>     We perform this query by scanning the related docIds of all terms in the 
> query, and the related docIds are derived from function  PostingsEnum docEnum 
> = MultiFields.getTermDocsEnum(indexReader, "text", term.bytes()) . After the 
> inner products of these related documents have been computed, the final 
> similarities are computed by dividing these inner products by their norms.
>     However, when the documents scale up, e.g., more than ten million titles 
> of twitter's text filed each on average has 10 terms, the runtime is 
> unacceptable (more than ten seconds) since we always need to merge 0.5~2 
> million documents to generate the inner products. Does Lucene provide more 
> efficient interface to generate ranked results based on TFIDF?
> Best
> Wenhai 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to