Help: tweaking search - reducing IDF skew and implementing score cutoff

Chun Wei Ho Thu, 09 Feb 2006 23:26:15 -0800

Hi,

I am running a search for something akin to a news site, when each
news document has a date, title, keywords/bylines, summary fields and
then the actual content. Using Lucene for this database of documents,
it seems that:


1. The relevancy score is skewed drastically by the actual number of
news document per day. For example if I RangedQuery on this week, and
there were 100 news on Monday and two news on Sunday, the two on
Sunday gets ranked highly due to idf. How do I reduce this skewness
due to the date-posted field? I saw a reference earlier to
ConstantScoreRangeQuery on JIRA - is it the solution?

2. If I choose to sort the results by date, then recent documents with
very very low relevancy (say the words searched appears only in
content, and not in title/bylines/summary fields that are boosted
higher) are still shown relatively high in the list, and I wish to
omit them in general. What is the best way to implement some sort of a
relevancy filter (include only documents with an normalized score of
0.2 or more....)? Or is there a better way around it?

Thanks :)

Best Regards,
CW

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Help: tweaking search - reducing IDF skew and implementing score cutoff

Reply via email to