Re: Beagle Scoring System

2005-12-17 Thread Debajyoti Bera
 I have noticed that mail messages seem to get unusually high scores
 from the indexer, while holmes makes the problem much less of a issue
 (since it separates the conversation results) it still seems like
 something worth fixing. I can't seem to figure out exactly why the
 scoring is so off, but an initial guess would be the ease with which
 we can add hotwords for email (subject lines) as opposed to most other
 backends.

(from http://wiki.apache.org/jakarta-lucene/LuceneFAQ )
Lucene automatically adds a weight inversely proportional to the length of the 
field i.e. terms in short fields (like sender name, email address, subject) 
will get a higher weight (known as 'boost') that terms in text. Same holds 
for document metadata - they have more weight than document data/text.

(from my understanding)
Beagle searches several lucene indexes and merges the results based on their 
scores. Somewhere during the process, it recalculates the score based on the 
age of the document. However, absolute value of lucene scores are not 
directly comparable - the ratio (and hence the ranking) between the scores 
are comparable. In that sense, I dont think scores across multiple indexes 
should be directly compared. Ranking in a particular backend is meaningful 
and IMO, that is correct way to do it.

- dBera
___
Dashboard-hackers mailing list
Dashboard-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/dashboard-hackers


Re: Beagle Scoring System

2005-12-17 Thread Kevin Kubasik
Unfortunately, that leaves us holding the bag on how to fix it... and
I am at a loss for anything short of some hard coded reduction
ratio/factor for all mail scores.

Perhaps its just something we have to handle in the front ends, Mail
and Chats are stored in separate indexes, maybe we should just stick
to that for the front ends as well...

-Kevin Kubasik
On 12/17/05, Debajyoti Bera [EMAIL PROTECTED] wrote:
  I have noticed that mail messages seem to get unusually high scores
  from the indexer, while holmes makes the problem much less of a issue
  (since it separates the conversation results) it still seems like
  something worth fixing. I can't seem to figure out exactly why the
  scoring is so off, but an initial guess would be the ease with which
  we can add hotwords for email (subject lines) as opposed to most other
  backends.

 (from http://wiki.apache.org/jakarta-lucene/LuceneFAQ )
 Lucene automatically adds a weight inversely proportional to the length of the
 field i.e. terms in short fields (like sender name, email address, subject)
 will get a higher weight (known as 'boost') that terms in text. Same holds
 for document metadata - they have more weight than document data/text.

 (from my understanding)
 Beagle searches several lucene indexes and merges the results based on their
 scores. Somewhere during the process, it recalculates the score based on the
 age of the document. However, absolute value of lucene scores are not
 directly comparable - the ratio (and hence the ranking) between the scores
 are comparable. In that sense, I dont think scores across multiple indexes
 should be directly compared. Ranking in a particular backend is meaningful
 and IMO, that is correct way to do it.

 - dBera
 ___
 Dashboard-hackers mailing list
 Dashboard-hackers@gnome.org
 http://mail.gnome.org/mailman/listinfo/dashboard-hackers



--
Kevin Kubasik
240-838-6616
___
Dashboard-hackers mailing list
Dashboard-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/dashboard-hackers