Hi Russel, I am also interested in the internals of Lucene's ranking and how one can/should alter the scoring. For now I was just learning from existing code of Lucene scorers and Weights. Your question seemed interesting, so I in fact implemented a quick scorer that would return the raw tf as a score, as an exercise. It is not a product level implementation of course, but if you think this will help you (?) I can share the code. (Would have responded sooner, for my working computer went off for a few days with a fan error...:-).
Regards, Doron "Russell M. Allen" <[EMAIL PROTECTED]> wrote on 31/07/2006 07:35:50: > Thank you for the reply Doran! You are exactly right about the sql > count(*). I need the equivalent of group by, and count(). > > We have considered a 'joined' index where we would have a document for > each permutation. We discarded it (possibly prematurely) based on the > rapid explosion in the number of documents. In our domain, we have > movies as the main document type, and 5 other satellite document types > with their own indexes: Star, Studio, Director, Series, and Category > (genre). With the exception of series, a movie has a many to many > relationship with the other indexes. So, with 60k movies, 20k stars, 2k > studios, ... The document count quickly shoots through the roof. > > Also, the majority of our searching is based on a single domain type, > such as movie. It is only a small handful of corner cases where we want > what amounts to a joined query. If we merged these indexes, we would > constantly have to 'roll up' the results into distinct instances of a > type. (The equivalent of an SQL 'group by') > > > I find the parallels between the expressiveness of Lucene and SQL > interesting. I'm glad to see you compared what I was looking for to an > sql count(*) as well. We have a handful of indexing issues that I am > attempting to solve/optimize, of which performing a count(*) is only > one. I also have the need to perform a JOIN across two indexes. I have > 'ideas' about how I might go about this, but for now we are fortunate > enough to have fairly static data and half of the join is static. As a > result we can cache a bitset filter for the results of half the join and > apply it to the other (dynamic) half of the join query. > > Anyway, I digress... > > I saw you second post regarding creating a scorer. I'd like to continue > down that path. My main issue now is simply understanding how lucene > works under the covers enough to write the TermQuery variant. > > Thanks for the help, > Russell. > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]