markharw00d <[EMAIL PROTECTED]> wrote on 26/09/2006 00:11:12: > If you were to score repeated terms then I suspect it would have to be > done so that the repetitions didn't score as highly as the first > occurrence - otherwise f2 could be selected as a better fragment than f3 > for the query q1 in your example. > Repetitions of a term in a fragment could be scored as a very small > fraction of the score given to the first occurrence. This would at least > rank f2 higher than f1 for query q2. > Another potentially useful ranking factor may be to boost fragments > found at the beginning of a document - that's where people tend to write > summaries or introductions.
Yes, it makes sense to add these heuristics. I was somewhat surprised to find that highlighting scoring simply counts how many unique query terms appear in the fragment. Guess was expecting a more similarity like ranking of fragments - something that would perhaps have tf related to the frequency of a term in a fragment, and idf related to the frequency of the term in the entire text. Idf would be meaningless for a single term query. Possibly, idf could relate to "iff" ~ inverse number of fragments containing the term. I am not sure if this is worth the effort, but it seems more correct...? Another thing I saw is that Highlighter seems to break the text arbitrarily by max-fragment-size, so for text: 1 2 x 4 a b x d y B C D if it happens to be broken into 4 tokens fragments, for query "x y" result would be: 1 2 x 4 - score 1 a b x d - score 1 y B C D - score 1 and the first fragment would be selected 'best', although the fragment "x d y B" that appears in that text is better. Again, not sure if this is worth the effort - having overlapping between candidate fragments - just something to think about. > > > Doron Cohen wrote: > > This question was raised in the user's list - > > http://www.nabble.com/highlighting-tf2322109.html > > > > Assume three fragments and two queries: > > f1 = aa 11 bb 33 cc > > f2 = aa 11 bb 11 cc > > f3 = aa 11 bb 22 cc > > q1 = 11 22 > > q2 = 11 > > Now we call highlighter.getBestFragment(q); > > For q1, f3 is returned, as expected. > > For q2, f1 is returned, although "11" appears twice in f2 but only once in > > f1. > > > > This is because QueryScorer.getTokenScore(Token) counts only unique > > fragment tokens. > > > > Would it make sense to make this behavior controllable? > > (It is easily done but I am not sure about the consequences.) > > > > Or perhaps there is a way to achieve this behavior (preferring f2 on f1 for > > q2 above) that I missed? > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > ___________________________________________________________ > Copy addresses and emails from any email account to Yahoo! Mail - > quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]