Mark, On Thursday 23 December 2004 14:25, mark harwood wrote: > Another thought on fuzzy scoring: > shouldn't all these queries which automatically expand > terms favour common words over rare ones? The default > scoring behaviour at the moment favours rare words. As > a user aren't I more likely to be looking for the most > common expansions? > > If I'm not sure how to spell I might search for: > accomodation~ > or > accom* > The fuzzy scoring algorithms will currently favour all > of the mis-spellings of accommodation in the ranking > of results because they are more rare. > > Ideally within the expansions of a term the score > contribution should be based on df (as opposed to the > usual idf) BUT within the overall query the usual idf > scheme applies. To clarify: > If I search for: > the cheapest accomodation~ in london > I want to see the most common spellings of > accommodation before all other variants of this word > BUT I then want these variants scored against the > OTHER words ("in", "the" etc) on the usual basis of > rarity. > > This suggests a sort order within another, different > sort order. > This seems like it would not be easy to do. Any bright > ideas?
The brightest idea I had so far is to drop the idf alltogether. Idf just doesn't seem to make much sense for terms related through expansion as fuzzy terms of as truncated terms. But since dropping idf is probably too controversial, one solution that uses idf is to use the minimum idf for all the expanded terms. Also the within document frequency for the expanded terms could be added over these terms before applying tf(), without a coordination factor as you suggested in the previous post. These three measures together would effectively treat each expanded term as having equal value for scoring. This would score the most common spellings equal to the less common ones. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]