Grant writes:
One question that comes to mind, is what are you looking to do?
What I'm trying to do is prevent Lucene from providing better ranking for documents that use a term multiple times than those that have more term hits. I've got some huge queries with quite a number of unique terms. I want the documents that hit more unique terms to float to the top, while documents that hit some or few of the terms to sink to the bottom (even if they have more occurrences of those terms). Lucene, as I understand things, does this for the most part, though it is possible that term frequency can play a significant roll and drown out the part of the desired behavior that I'd like to keep. My users want to grab search results using the currently Lucene method, then select a checkbox and search without term frequency contributing to the score. I, on the other hand, have a vested interest in not maintaining two indexes. Daniel wrote:
I don't want the title "Caesar Came, Caesar Saw, Caesar Conquered" to have a higher score just because the word Caesar is repeated.
This is very much the kind of problem I'm trying to address.
I'm currently looking at whether overriding queryNorm() to always return 1 is a good thing or not.
I vaguely recall reading something about boosts as well; I'm not sure you want to mess with this one. For me, idf() is the bigger question. Yonik writes:
For a single term, [freq is] determined at index time and stored in the index.
I guess what I'm asking is, is freq, the value passed to tf(), the count of the term, or a ratio of the term to total terms in the index.
Scores across queries aren't really comparable though. ... Lucene doesn't currently do this "mixing", ...
This has always been my understanding of search engines, that the raw scores are effectively meaningless if used to mix other query results together. However, according to the documentation at http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html, it would seem that #4 queryNorm() says: "This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable." This to me is black magic. It alludes that one can do two different queries and a merge-sort, and further that the content can come from different indexes. Either I'm reading this completely wrong, or the documentation may need an update. Hope this provides better clarification. As for why am I doing asking all this? What I'd like to do is get a firm understanding of how Lucene does scoring in such a way that the behavior can be modified, and if ambitious enough fill in some of the holes in the documentation. Thanks all, -wls --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]