On Jul 7, 2005, at 3:16 PM, Mark Bennett wrote:
Scanning their paper very quickly, I didn't see a specific mention
(though I
might have missed it) of extremely short documents (< 5 words).
The study does not concern itself with different document lengths.
They chose 6 different collections, but it appears that they were
looking for a diversity of authorship and subject matter.
Was there
something specific about 1 and 2 word documents you had in mind?
Could you use a negative document boost on 1 and 2 word docs to solve
your particular problem?
After pondering the clip method a little more, I've become wary of
its effect on title fields. It would work very well on what you
refer to as "main" and I generally call "bodytext", but if it were
set as a default, it would become necessary to weight "title" fields
or short "keywords" fields more heavily.
I think it would be possible, even desirable, to turn on clipping for
bodytext while turning it off for title/keywords. That would require
the implementor to be familiar with scoring formula theory, though.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]