On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote: > Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- > ? > Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png > Traditional log(N/x) tail, but when nearing zero freq, instead of > going to +inf you do a nice round bump (with controlled > height/location/sharpness) and drop down to -inf (or zero). I haven't used that technique, nor can I quote academic literature blessing it. Nevertheless, what you're doing makes sense makes sense to me.
> Rationale is that - most good, discriminating terms are found in at > least a certain percentage of your documents, but there are lots of > mostly unique crapterms, which at some collection sizes stop being > strictly unique and with IDF's help explode your scores. So you've designed a heuristic that allows you to filter a certain kind of noise. It sounds a lot like how people tune length normalization to adapt to their document collections. Many tuning techniques are corpus-specific. Whatever works, works! Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
