Question about high-performance methods to drive TFIDF queries.

李文海 Sun, 07 Jan 2018 05:53:45 -0800

Hi, all.
   Recently, we were performing experiment on Lucene based on TFIDF.
   We want to get the similar documents from the corpus, of which the 
similarity between each document  (d) and the given query (q) is no less than a 
threshold. We use the following scoring function.


   sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q)),

   where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).

  We perform this query by scanning the related docIds of all terms in the 
query, and the related docIds are derived from function  PostingsEnum docEnum = 
MultiFields.getTermDocsEnum(indexReader, "text", term.bytes()) . After the 
inner products of these related documents have been computed, the final 
similarities are computed by dividing these inner products by their norms.

   However, when the documents scale up, e.g., more than ten million titles of 
twitter's text filed each on average has 10 terms, the runtime is unacceptable 
(more than ten seconds) since we always need to merge 0.5~2 million documents 
to generate the inner products. Does Lucene provide more efficient interface to 
generate ranked results based on TFIDF, or directly filter out the dissimilar 
documents (in lucene core) for a given threshold in the range of (0, 1)?

Best,
Wenhai

Question about high-performance methods to drive TFIDF queries.

Reply via email to