Thanks for sharing Walter! I hope someone enterprising tackles it. It'd be nice to have global IDF by default without having to go enable something that adds a performance risk.
I'm sure you have many career stories to tell. If you find yourself at Acadia National Park hiking & backpacking, as you like to do, shoot me a message. :-D ~ David On Tue, Aug 27, 2024 at 3:01 PM Walter Underwood <wun...@wunderwood.org> wrote: > > When I’ve enabled global exact IDF in Solr, the speed penalty was about 10X. > Back in 1995, Infoseek figured out how to do that with no speed penalty. They > patented it, but that patent expired several years ago. I’ll try and hunt it > down. > > Short version, from each shard return the number of docs and the df for each > term. When combining results, add all the DF, add all the NUMDOCS, divide, > and you have the global IDF. This is constant for the whole result list. Each > shard already needs that info for local score, so it shouldn’t be extra work. > > When does this matter? When the relevant documents for a term are mostly on > one shard, either intentionally or accidentally. Let’s say we have a news > search and all the stories for August 2024 are on one shard. The term > “kamala” will be much more common on that shard, giving a lower IDF, but…the > relevant documents are probably on that shard. So the best documents have a > lower score using local IDF. > > This also shows up with lots of shards or small shards, because there will be > uneven distribution of docs. When I retired from LexisNexis, we had a cluster > with 320 shards. I’m sure that had some interesting IDF behavior. > > I wrote up how we did this in a Java distributed search layer for Ultraseek: > https://observer.wunderwood.org/2007/04/04/progressive-reranking/ > > There is some earlier discussion here: > https://solr-user.lucene.apache.narkive.com/zNa1Hn4p/single-call-for-distributed-idf > > I don’t think there is a Jira issue for this. > > I think that is all the unfinished business since putting Solr 1.3 into > production at Netflix. Pretty darned good job everybody. Huge thanks to all > the contributors and committers who have put in years of effort over that > time. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org