Hello,
I have a question about idf computation for different fields:

As we know, idf = Math.log(numDocs/(docFreq+1)) + 1.0
docFreq is field specific, however, numDocs is a shared number for all
fields.

for example:
Assume there are 1M docs, mean numDocs=10^6
all of the docs have field_1, but only 10,000 have non-empty field_2, thus
for a word, maybe docFreq(field_1)=1000 while docFreq(field_2)=10, then idf
of field_2 will be much higher than field_1:
idf(field_1) = ln(10^6/(1000+1))+1
idf(field_2) = ln(10^6/(10+1))+1

Then if I want to use a DisjunctionMaxQuery(field_1,field_2), the score is
unfair. And setBoost is not a ideal method to adjust this score.
Any suggestion on this?

And actually I think the result is somehow unreasonable, do you think it
worth a jira ticket to replace the total numDocs with non-empty docs num on
special field in the idf expression?

Thanks!

-- 
Regards,
    Boyan

Reply via email to