Thanks for sharing Walter!  I hope someone enterprising tackles it.
It'd be nice to have global IDF by default without having to go enable
something that adds a performance risk.

I'm sure you have many career stories to tell.  If you find yourself
at Acadia National Park hiking & backpacking, as you like to do, shoot
me a message. :-D

~ David

On Tue, Aug 27, 2024 at 3:01 PM Walter Underwood <wun...@wunderwood.org> wrote:
>
> When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. 
> Back in 1995, Infoseek figured out how to do that with no speed penalty. They 
> patented it, but that patent expired several years ago. I’ll try and hunt it 
> down.
>
> Short version, from each shard return the number of docs and the df for each 
> term. When combining results, add all the DF, add all the NUMDOCS, divide, 
> and you have the global IDF. This is constant for the whole result list. Each 
> shard already needs that info for local score, so it shouldn’t be extra work.
>
> When does this matter? When the relevant documents for a term are mostly on 
> one shard, either intentionally or accidentally. Let’s say we have a news 
> search and all the stories for August 2024 are on one shard. The term 
> “kamala” will be much more common on that shard, giving a lower IDF, but…the 
> relevant documents are probably on that shard. So the best documents have a 
> lower score using local IDF.
>
> This also shows up with lots of shards or small shards, because there will be 
> uneven distribution of docs. When I retired from LexisNexis, we had a cluster 
> with 320 shards. I’m sure that had some interesting IDF behavior.
>
> I wrote up how we did this in a Java distributed search layer for Ultraseek: 
> https://observer.wunderwood.org/2007/04/04/progressive-reranking/
>
> There is some earlier discussion here: 
> https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf
>
> I don’t think there is a Jira issue for this.
>
> I think that is all the unfinished business since putting Solr 1.3 into 
> production at Netflix. Pretty darned good job everybody. Huge thanks to all 
> the contributors and committers who have put in years of effort over that 
> time.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Reply via email to