After my last question, I am now intrigued by the alternative suggested.  
Defining a 'Super-Corpus' (Collection).  We are using Stock Lucene (not Solr or 
anything else).   Is there a known method already to integrate the DF for 
multiple collections allowing such a cross-collection  DF?  

I think I like the simplicity of the first method, but it remains to be seen if 
it would satisfy the relevancy needs of the application.

Thoughts?

-----Original Message-----
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Saturday, September 06, 2014 12:10 PM
To: java-user@lucene.apache.org
Subject: Re: How to properly correlate relevance in a search across multiple 
collections

An observation: df and IDF (document frequency) is a key driver of the whole 
relevancy framework on which stock Lucene is based. There is no question about 
its significant value. But... that means that we can't blindly "correlate" 
relevancy between "collections", in large part because the document scores are 
so heavily driven by df, which is distinctly based on the specific corpus of 
each collection.

My modest proposal: As valuable as df-based relevancy is, offer an easy to use 
"switch" to drop back to a pure tf-based relevancy score (primarily tf, but it 
can include other factors, but simply limited to the contents of the document 
itself) to sidestep these corpus-dependent scores. In other words, the score of 
the document could depend on only the contents of the document itself, not the 
corpus. Yes, that's a major loss of relevance, but the benefits for operations 
in a multi-corpus, distributed world can be substantial.

Yes, you can do this yurself by just plugging in your own custom "similarity" 
class, but it should be offered as a much easier to use "switch" for Lucene 
itself (and Solr too!)

The alternative is to have some mechanism to define and work with a 
"super-corpus" or "super-collection" that integrates the df for multiple 
corpuses, but... df is calculated or updated for the overall corpus, so a 
cross-corpus df would require recalculating df for all terms in the index 
whenever the multi-corpus structure changes, which can work in some cases, but 
not for things like distributed searches for Solr. That might be a superior 
solution, but might now be so easy or as performant as a simple non-df 
similarity approach.

It might also be nice for apps to offer users pure-tf scoring if it provides 
faster search results, and then the user could click on a "refine results" 
button to re-do the search with the more expensive cross-corpus df-based 
scoring.

Thoughts?

-- Jack Krupansky

-----Original Message-----
From: Baldwin, David
Sent: Friday, September 5, 2014 8:05 PM
To: java-user@lucene.apache.org
Subject: How to properly correlate relevance in a search across multiple 
collections

I have a project where there are multiple collections - could be dozens at 
times that a single results set needs to be generated by applying the same 
search criteria to each collection directory and then correlating all the sub 
searches into a single result set with correlating relevance.

Does anyone have any good experience with this and could they share some 
tid-bits or info I may not have run across yet?

-David


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to