Calculation of fieldNorm causes irritating effect of sort order

Jimi Hullegård Thu, 02 Oct 2008 04:39:13 -0700

Hi,

Maybe I have missunderstood the general concept of how search results should be 
scored in regards to the fieldNorm, but the way i see it it causes an 
irritating effect of the sort order for me.


Here's the deal:

I'm building a simple site with documents that represents ideas. Each idea can 
be active or inactive. Our search page have a simple textfield for search text 
input. Other then that, the only thing the user can influence is whether to 
search on all ideas, or only active ones. The problem is that if the search for 
all ideas only had active ideas in the result, the sort order can change if the 
user then wants to do the same search but for only active ideas.

Example:

A search for "betyg", where the user doesn't care if the ideas are active or 
inactive, gives this result:

document-153
document-244

The user then checkes the checkbox "Only active ideas", and clicks the search 
button again. Now the result is:

document-244
document-153

When I turned on debug mode for the lucene part of the 3rd party CMS, I saw the 
queries that lucene got:

The first query:
+type:idea +alltext:betyg

The second query:
+type:idea +(+alltext:betyg +category:14)

(The category 14 represents the status Active.)


I started Luke, and did the same searches there, and got the same result there 
(the results sort order of the first search was the reverse of the results sort 
order of the second search). I then clicked the "Explain" button for each 
document. There I found that all nodes had the same value for both documents, 
except for the last one, the fieldNorm for the field category.

I then did a quick google search for this fieldNorm, and found this:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg06275.html

so the fieldNorm is the product of the field boost for the document and the 
lengthNorm for the field in the document. I am pretty sure that the boost is 
the same for both documents, so that leaves only the lengthNorm. And according 
to the javadoc for the Similarity class, the lengthNorm value depends on the 
number of tokens in the field for the particular document. And now the strange 
behaivor makes sence, because the document 153 has a total of 6 different 
tokens for the category field, and the document 244 has only 5. But in this 
case, this behaivor is not really what I want. Do you have any suggestions on 
how to solve this? Is it possible to disable the lengthNorm calculation for 
particular fields?

Regards
/Jimi

mogul | jimi hullegård | system developer | hudiksvallsgatan 4, 113 30 
stockholm sweden | +46 8 506 66 172 | +46 765 27 19 55 | [EMAIL PROTECTED] | 
www.mogul.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Calculation of fieldNorm causes irritating effect of sort order

Reply via email to